Setup of cc-backend

How to configure and deploy cc-backend

Introduction

cc-backend is the main hub within the ClusterCockpit framework. Its configuration consists of the general part in config.json and the cluster configurations in cluster.json files, that are part of the job archive. The job archive is a long-term persistent storage for all job meta and metric data. The job meta data including job statistics as well as the user data are stored in a SQL database. Secrets as passwords and tokens are provided as environment variables. Environment variables can be initialized using a .env file residing in the same directory as cc-backend. If using an .env file environment variables that are already set take precedence.

Note (cc-backend before v1.5.0)

For versions before v1.5.0 the .env file was the only option to set environment variables, and they could not be set by other means!

Configuration

cc-backend provides a command line switch to generate an initial template for all required configuration files apart from the job archive:

./cc-backend -init

This will create the ./var folder, generate initial version of the config.json and .env files, and initialize a sqlite database file.

`config.json`

Below is a production configuration enabling the following functionality:

Use HTTPS only
Mark jobs as short job if smaller than 5m
Enable authentication and user syncing via an LDAP directory
Enable to initiate a user session via an JWT token, e.g. by an IDM portal
Drop permission after privileged ports are taken
Use compression for metric data files in job archive
Allow access to the REST API from all IPs
enable re-sampling of time-series metric data for long jobs
Configure three clusters using one local cc-metric-store
Use a sqlite database (this is the default)

{
    "addr":            "0.0.0.0:443",
    "short-running-jobs-duration": 300,
    "ldap": {
        "url":        "ldaps://hpcldap.rrze.uni-erlangen.de",
        "user_base":   "ou=people,ou=hpc,dc=rrze,dc=uni-erlangen,dc=de",
        "search_dn":   "cn=hpcmonitoring,ou=roadm,ou=profile,ou=hpc,dc=rrze,dc=uni-erlangen,dc=de",
        "user_bind":   "uid={username},ou=people,ou=hpc,dc=rrze,dc=uni-erlangen,dc=de",
        "user_filter": "(&(objectclass=posixAccount))",
        "sync_interval": "24h"
    },
    "jwts": {
        "syncUserOnLogin": true,
        "updateUserOnLogin":true,
        "validateUser": false,
        "trustedIssuer": "https://portal.hpc.fau.de/",
        "max-age": "168h"
    },
    "https-cert-file": "/etc/letsencrypt/live/monitoring.nhr.fau.de/fullchain.pem",
    "https-key-file":  "/etc/letsencrypt/live/monitoring.nhr.fau.de/privkey.pem",
    "user":            "clustercockpit",
    "group":           "clustercockpit",
    "archive": {
        "kind": "file",
        "path": "./var/job-archive",
        "compression": 7,
        "retention": {
            "policy": "none"
        }
    },
    "apiAllowedIPs": [
      "*"
    ],
    "enable-resampling": {
              "trigger": 30,
              "resolutions": [
                        600,
                        300,
                        120,
                         60
                ]
    },
    "emission-constant": 317,
    "clusters": [
        {
            "name": "fritz",
            "metricDataRepository": {
                "kind": "cc-metric-store",
                "url": "http://localhost:8082",
                "token": "XYZ"
            },
            "filterRanges": {
                "numNodes": { "from": 1, "to": 64 },
                "duration": { "from": 0, "to": 86400 },
                "startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
            }
        },
        {
            "name": "alex",
            "metricDataRepository": {
                "kind": "cc-metric-store",
                "url": "http://localhost:8082",
                "token": "XYZ"
            },
            "filterRanges": {
                "numNodes": { "from": 1, "to": 64 },
                "duration": { "from": 0, "to": 86400 },
                "startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
            }
        },
        {
            "name": "woody",
            "metricDataRepository": {
                "kind": "cc-metric-store",
                "url": "http://localhost:8082",
                "token": "XYZ"
            },
            "filterRanges": {
                "numNodes": { "from": 1, "to": 1 },
                "duration": { "from": 0, "to": 172800 },
                "startTime": { "from": "2020-01-01T00:00:00Z", "to": null }
            }
        }
    ]
}

The cluster names have to match the clusters configured in the job-archive. The filter ranges in the cluster configuration affect the filter UI limits in frontend views and should reflect your typical job properties.

Job archive

In case you place the job-archive in the ./var folder create the folder with:

mkdir -p ./var/job-archive

The job-archive is versioned, the current version is documented in the Release Notes. Currently you have to create the version file manually when initializing the job-archive:

echo 2 > ./var/job-archive/version.txt

Directory layout

ClusterCockpit supports multiple clusters, for each cluster you need to create a directory named after the cluster and a cluster.json file specifying the metric list and hardware partitions within the clusters. Hardware partitions are subsets of a cluster with homogeneous hardware (CPU type, memory capacity, GPUs) that are called subclusters in ClusterCockpit.

For above configuration the job archive directory hierarchy looks like the following:

./var/job-archive/
     version.txt
     fritz/
        cluster.json
     alex/
        cluster.json
     woody/
        cluster.json

`cluster.json`: Basics

The cluster.json file contains three top level parts: the name of the cluster, the metric configuration, and the subcluster list. You find the latest cluster.json schema here. Basic layout of cluster.json files:

{
  "name": "fritz",
  "metricConfig": [
    {
      "name": "cpu_load",
      ...
    },
    {
      "name": "mem_used",
      ...
    }
  ],
  "subClusters": [
    {
      "name": "main",
      ...
    },
    {
      "name": "spr",
      ...
    }
  ]
}

`cluster.json`: Metric configuration

Example for a metric list entry with only the required attributes:

"metricConfig": [
    {
        "name": "flops_sp",
        "unit": {
            "base": "Flops/s",
            "prefix": "G"
        },
        "scope": "hwthread",
        "timestep": 60,
        "aggregation": "sum",
        "peak": 5600,
        "normal": 1000,
        "caution": 200,
        "alert": 50
    }
]

Explanation of required attributes:

name: The metric name. This must match the metric name in cc-metric-store!
unit: The metrics unit. Base can be: B (for bytes), F (for flops), B/s, F/s, CPI (for cycles per instruction), IPC (for instructions per cycle), Hz, W (for Watts), °C, or empty string for no unit. Prefix can be: K, M, G, T, P, or E.
scope: The native metric measurement resolution. Can be node, socket, memoryDomain, core, hwthread, or accelerator.
timestep: The measurement frequency in seconds
aggregation: How the metric is aggregated with in node topology. Can be one of sum, avg, or empty string for no aggregation (node level metrics).
Metric thresholds. If threshold applies for larger or smaller values depends on optional attribute lowerIsBetter (default false).
- peak: The maximum possible metric value
- normal: A common metric value level
- caution: Metric value requires attention
- alert: Metric value requiring immediate attention

Optional attributes:

footprint: Is this a job footprint metric. Set to how the footprint is aggregated: Can avg, min, or max. Footprint metrics are shown in the footprint UI component and job view polar plot.
energy: Should the metric be used to calculate the job energy. Can be power (metric has unit Watts) or energy (metric has unit Joules).
lowerIsBetter: Is lower better. Influences frontend UI and evaluation of metric thresholds.
subClusters (Type: array of objects): Overwrites for specific subClusters. The metrics per default are valid for all subClusters. It is possible to overwrite or remove metrics for specific subClusters. If a metric is overwritten for a subClusters all attributes have to be set, partial overwrites are not supported. Example for a metric overwrite:

{
    "name": "mem_used",
    "unit": {
        "base": "B",
        "prefix": "G"
    },
    "scope": "node",
    "aggregation": "sum",
    "footprint": "max",
    "timestep": 60,
    "lowerIsBetter": true,
    "peak": 256,
    "normal": 128,
    "caution": 200,
    "alert": 240,
    "subClusters": [
        {
            "name": "spr1tb",
            "footprint": "max",
            "peak": 1024,
            "normal": 512,
            "caution": 900,
            "alert": 1000
        },
        {
            "name": "spr2tb",
            "footprint": "max",
            "peak": 2048,
            "normal": 1024,
            "caution": 1800,
            "alert": 2000
        }
    ]
},

This metric characterizes the memory capacity used by a job. Aggregation for a job is the sum of all node values. As footprint the largest allocated memory capacity is used. For this configuration lower is better is set, which results in jobs with more than the metric thresholds are marked. There exist two subClusters with 1TB and 2TB memory capacity compared to the default 256GB.

Example for removing metrics for a subcluster:

{
     "name": "vectorization_ratio",
     "unit": {
         "base": ""
     },
     "scope": "hwthread",
     "aggregation": "avg",
     "timestep": 60,
     "peak": 100,
     "normal": 60,
     "caution": 40,
     "alert": 10,
     "subClusters": [
         {
             "name": "icelake",
             "remove": true
         }
     ]
}

`cluster.json`: subcluster configuration

SubClusters in ClusterCockpit are subsets of a cluster with homogeneous hardware. The subCluster part specifies the node topology, a list of nodes that are part of a subClusters, and the node capabilities that are used to draw the roofline diagrams.

Here is an example:

{
    "name": "icelake",
    "nodes": "w22[01-35],w23[01-35],w24[01-20],w25[01-20]",
    "processorType": "Intel Xeon Gold 6326",
    "socketsPerNode": 2,
    "coresPerSocket": 16,
    "threadsPerCore": 1,
    "flopRateScalar": {
        "unit": {
            "base": "F/s",
            "prefix": "G"
        },
        "value": 432
    },
    "flopRateSimd": {
        "unit": {
            "base": "F/s",
            "prefix": "G"
        },
        "value": 9216
    },
    "memoryBandwidth": {
        "unit": {
            "base": "B/s",
            "prefix": "G"
        },
        "value": 350
    },
    "topology": {
        "node": [
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
           21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
           41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
           61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
        ],
        "socket": [
            [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
             20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 ],
            [ 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
             54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71 ]
        ],
        "memoryDomain": [
            [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ],
            [ 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 ],
            [ 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53 ],
            [ 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71 ]
        ],
        "core": [
            [ 0 ], [ 1 ], [ 2 ], [ 3 ], [ 4 ], [ 5 ], [ 6 ], [ 7 ], [ 8 ], [ 9 ], [ 10 ],
           [ 11 ], [ 12 ], [ 13 ], [ 14 ], [ 15 ], [ 16 ], [ 17 ], [ 18 ], [ 19 ], [ 20 ],
           [ 21 ], [ 22 ], [ 23 ], [ 24 ], [ 25 ], [ 26 ], [ 27 ], [ 28 ], [ 29 ], [ 30 ],
           [ 31 ], [ 32 ], [ 33 ], [ 34 ], [ 35 ], [ 36 ], [ 37 ], [ 38 ], [ 39 ], [ 40 ],
           [ 41 ], [ 42 ], [ 43 ], [ 44 ], [ 45 ], [ 46 ], [ 47 ], [ 48 ], [ 49 ], [ 50 ],
           [ 51 ], [ 52 ], [ 53 ], [ 54 ], [ 55 ], [ 56 ], [ 57 ], [ 58 ], [ 59 ], [ 60 ],
           [ 61 ], [ 62 ], [ 63 ], [ 64 ], [ 65 ], [ 66 ], [ 67 ], [ 68 ], [ 69 ], [ 70 ], [ 71 ]
        ]
    }
}

Since it is tedious to write this by hand, we provide a Perl script as part of cc-backend that generates a subCluster template. This script only works if the LIKWID tools are installed and in the PATH. You also need the LIKWID library for cc-metric-store. You find instructions on how to install LIKWID here.

The resource ID for cores is the OS processor ID. For GPUs we recommend to use the PCI-E address as resource ID.

Here is an example for a subCluster with GPU accelerators:

{
    "name": "a100m80",
    "nodes": "a[0531-0537],a[0631-0633],a0731,a[0831-0833],a[0931-0934]",
    "processorType": "AMD Milan",
    "socketsPerNode": 2,
    "coresPerSocket": 64,
    "threadsPerCore": 1,
    "flopRateScalar": {
        "unit": {
            "base": "F/s",
            "prefix": "G"
        },
        "value": 432
    },
    "flopRateSimd": {
        "unit": {
            "base": "F/s",
            "prefix": "G"
        },
        "value": 9216
    },
    "memoryBandwidth": {
        "unit": {
            "base": "B/s",
            "prefix": "G"
        },
        "value": 400
    },
    "topology": {
        "node": [
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
         21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
         41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
         61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
         81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
        101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
        117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
        ],
        "socket": [
            [
               0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
              21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
              41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
              61, 62, 63
            ],
            [
              64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
              81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
             101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
             117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
            ]
        ],
        "memoryDomain": [
            [
              0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
             21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
             41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
             61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
             81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
            101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
            117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
            ]
        ],
        "core": [
            [ 0 ], [ 1 ], [ 2 ], [ 3 ], [ 4 ], [ 5 ], [ 6 ], [ 7 ], [ 8 ], [ 9 ], [ 10 ], [ 11 ],
            [ 12 ], [ 13 ], [ 14 ], [ 15 ], [ 16 ], [ 17 ], [ 18 ], [ 19 ], [ 20 ], [ 21 ], [ 22 ],
            [ 23 ], [ 24 ], [ 25 ], [ 26 ], [ 27 ], [ 28 ], [ 29 ], [ 30 ], [ 31 ], [ 32 ], [ 33 ],
            [ 34 ], [ 35 ], [ 36 ], [ 37 ], [ 38 ], [ 39 ], [ 40 ], [ 41 ], [ 42 ], [ 43 ], [ 44 ],
            [ 45 ], [ 46 ], [ 47 ], [ 48 ], [ 49 ], [ 50 ], [ 51 ], [ 52 ], [ 53 ], [ 54 ], [ 55 ],
            [ 56 ], [ 57 ], [ 58 ], [ 59 ], [ 60 ], [ 61 ], [ 62 ], [ 63 ], [ 64 ], [ 65 ], [ 66 ],
            [ 67 ], [ 68 ], [ 69 ], [ 70 ], [ 71 ], [ 73 ], [ 74 ], [ 75 ], [ 76 ], [ 77 ], [ 78 ],
            [ 79 ], [ 80 ], [ 81 ], [ 82 ], [ 83 ], [ 84 ], [ 85 ], [ 86 ], [ 87 ], [ 88 ], [ 89 ],
            [ 90 ], [ 91 ], [ 92 ], [ 93 ], [ 94 ], [ 95 ], [ 96 ], [ 97 ], [ 98 ], [ 99 ], [ 100 ],
           [ 101 ], [ 102 ], [ 103 ], [ 104 ], [ 105 ], [ 106 ], [ 107 ], [ 108 ], [ 109 ], [ 110 ],
           [ 111 ], [ 112 ], [ 113 ], [ 114 ], [ 115 ], [ 116 ], [ 117 ], [ 118 ], [ 119 ], [ 120 ],
           [ 121 ], [ 122 ], [ 123 ], [ 124 ], [ 125 ], [ 126 ], [ 127 ]
        ],
        "accelerators": [
            {
                "id": "00000000:0E:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:13:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:49:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:4F:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:90:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:96:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:CC:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            },
            {
                "id": "00000000:D1:00.0",
                "type": "Nvidia GPU",
                "model": "A100"
            }
        ]
    }
}

You have to ensure that the metric collector also uses the PCI-E address as a resource ID.

Environment variables

Secrets are provided in terms of environment variables. The only two required secrets are JWT_PUBLIC_KEY and JWT_PRIVATE_KEY used for signing generated JWT tokens and validate JWT authentication.

Please refer to the environment reference for details.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Setup of cc-backend

Tags:

Categories:

Introduction

Note (cc-backend before v1.5.0)

Configuration

`config.json`

Job archive

Directory layout

`cluster.json`: Basics

`cluster.json`: Metric configuration

`cluster.json`: subcluster configuration

Environment variables

Feedback

Setup of cc-backend

Introduction

Note (cc-backend before v1.5.0)

Configuration

config.json

Job archive

Directory layout

cluster.json: Basics

cluster.json: Metric configuration

cluster.json: subcluster configuration

Environment variables

Feedback

`config.json`

`cluster.json`: Basics

`cluster.json`: Metric configuration

`cluster.json`: subcluster configuration