This is the multi-page printable view of this section. Click here to print.
Tutorials
1 - ClusterCockpit installation manual
Introduction
ClusterCockpit requires the following components:
A node agent running on all compute nodes that measures required metrics and forwards all data to a time series metrics database. ClusterCockpit provides its own node agent
cc-metric-collector
. This is the recommended setup, but ClusterCockpit can also be integrated with other node agents, e.g.collectd
,prometheus
ortelegraf
. In this case you have to use it with the accompanying time series database.A metric time series database. ClusterCockpit provides its own solution
cc-metric-store
, that is the recommended solution. There is also metric store support for Prometheus and InfluxDB. InfluxDB is currently barely tested. Usually only one instance of the time series database is required.The api and web interface backend
cc-backend
. Only one instance ofcc-backend
is required. This will provide the HTTP server at the desired monitoring URL for serving the web interface.A SQL database. It is recommended to use the builtin sqlite database for ClusterCockpit. You can setup LiteStream as a service which performs a continuous replication of the sqlite database to multiple storage backends. Optionally
cc-backend
also supports MariaDB/MySQL as SQL database backends.A batch job scheduler adapter that provides the job meta information to
cc-backend
. This is done by using the provided REST api for starting and stopping jobs. Currently available adapters:- Slurm: Python based solution (cc-slurm-sync) maintained by PC2 Paderborn
- Slurm: Golang based solution (cc-slurm-adapter) maintained by NHR@FAU
- HTCondor: cc-condor-sync maintained by Saarland University
Server Hardware
cc-backend
is threaded and therefore profits from multiple cores. It does not
require a lot of memory. It is possible to run it in a virtual machine. For best
performance the ./var
folder of cc-backend
which contains the sqlite
database file and the file based job archive should be located on a fast storage
device, ideally a NVMe SSD. The sqlite database file and the job archive will
grow over time (if you are not removing old jobs using a retention policy).
Our setup covering five clusters over 4 years take 50GB for the sqlite database
and around 700GB for the job archive.
cc-metric-store
is also threaded and requires a fixed amount of main memory.
How much depends on your configuration, but 128GB should be enough for most
setups. We run cc-backend
and cc-metric-store
on the same server as
systemd services.
Planning and initial configuration
We recommended the following order for planning and configuring a ClusterCockpit installation:
- Setup your metric list: With two exceptions
you are in general free which metrics you want choose. Those exceptions are:
mem_bw
for main memory bandwidth and ‘flops_any’ for flop throughput (double precision flops are upscaled to single precision rates). The metric list is an integral component for the configuration of all ClusterCockpit components. - Planning of deployment
- Configure and deploy
cc-metric-store
- Configure and deploy
cc-metric-collector
- Configure and deploy
cc-backend
- Setup batch job scheduler adapter
You can find complete example production configurations in the cc-examples repository.
Common problems
Up front here is a list with common issues people are facing when installing ClusterCockpit for the first time.
Inconsistent metric names across components
At the moment you need to configure the metric list in every component
separately. In cc-metric-collector
the metrics that are send to the
cc-metric-store
are determined by the collector configuration and possible
renaming in the router configuration. For cc-metric-store
in config.json
you
need to specify a metric list in-order to configure the native metric frequency
and how a metric is aggregated. Metrics that are send to cc-metric-store
and
do not appear in its configuration are silently dropped!
In cc-backend
for every cluster you need to create a cluster.json
configuration in the job-archive. There you setup which metrics are shown in the
web-frontend including many additional properties for the metrics. For running
jobs cc-backend
will query cc-metric-store
for exactly those metric names
and if there is no match there will be an error.
We provide a json schema based specification as part of the job meta and metric
schema. This specification recommends a minimal set of metrics and we suggest to
use the metric names provided there. While it is up to you if you want to adhere
to the metric names suggested in the schema, there are two exceptions: mem_bw
(main memory bandwidth) and flops_any
(total flop rate with DP flops scaled to
SP flops) are required for the roofline plots to work.
Inconsistent device naming between cc-metric-collector
and batch job scheduler adapter
The batch job scheduler adapter (e.g. cc-slurm-sync
) provides a list of
resources that are used by the job. cc-backend
will query cc-metric-store
with exactly those resource ids for getting all metrics for a job.
As a consequence if cc-metric-collector
uses another systematic the metrics
will not be found.
If you have GPU accelerators cc-slurm-sync
should use the PCI-E device
addresses as ids. The option use_pci_info_as_type_id
for the nvidia and
rocm-smi collectors in the collector configuration must be set to true.
To validate and debug problems you can use the cc-metric-store
debug endpoint:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"
This will return the current state of cc-metric-store
. You can search for a
hostname and there scroll for all topology leaf nodes that are available.
Missing nodes in subcluster node lists
ClusterCockpit supports multiple subclusters as part of a cluster. A subcluster
in this context is a homogeneous hardware partition with a dedicated metric
and device configuration. cc-backend
dynamically matches the nodes a job runs
on to subcluster node list to figure out on which subcluster a job is running.
If nodes are missing in a subcluster node list this fails and the metric list
used may be wrong.
2 - Decide on metric list
Introduction
To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.
ClusterCockpit supports multiple clusters within one instance of cc-backend
.
You have to create separate metric lists for each of them. In cc-backend
the
metric lists are provided as part of the cluster configuration. Every cluster is
configured as part of the
job archive using one
cluster.json
file per cluster.
This how-to describes
in-detail how to create a cluster.json
file.
Required Metrics
Flop throughput rate: flops_any
Memory bandwidth: mem_bw
Memory capacity used: mem_used
Requested cpu core utilization: cpu_load
Total fast network bandwidth: net_bw
Total file IO bandwidth: file_bw
Recommended CPU Metrics
Instructions throughput in cycles: ipc
User active CPU core utilization: cpu_user
Double precision flop throughput rate: flops_dp
Single precision flop throughput rate: flops_sp
Average core frequency: clock
CPU power consumption: rapl_power
Recommended GPU Metrics
GPU utilization: acc_used
GPU memory capacity used: acc_mem_used
GPU power consumption: acc_power
Recommended node level metrics
Ethernet read bandwidth: eth_read_bw
Ethernet write bandwidth: eth_write_bw
Fast network read bandwidth: ic_read_bw
Fast network write bandwidth: ic_write_bw
File system metrics
Warning
A file system metric tree is currently not yet supported incc-backend
In the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:
- nfs
- lustre
- gpfs
- nvme
- ssd
- hdd
- beegfs
File system read bandwidth: read_bw
File system write bandwidth: write_bw
File system read requests: read_req
File system write requests: write_req
File system inodes used: inodes
File system open and close: accesses
File system file syncs: fsync
File system file creates: create
File system file open: open
File system file close: close
File system file syncs: seek
3 - Deployment
Deployment
Why we do not provide a docker container
The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.It is recommended to install all ClusterCockpit components in a common
directory, e.g. /opt/monitoring
, var/monitoring
or var/clustercockpit
. In
the following we use /opt/monitoring
.
Two Systemd services run on the central monitoring server:
- clustercockpit : binary cc-backend in
/opt/monitoring/cc-backend
. - cc-metric-store : Binary cc-metric-store in
/opt/monitoring/cc-metric-store
.
ClusterCockpit is deployed as a single binary that embeds all static assets.
We recommend keeping all cc-backend
binary versions in a folder archive
and
linking the currently active one from the cc-backend
root.
This allows for easy roll-back in case something doesn’t work.
Please Note
cc-backend
is started with root rights to open the privileged ports (80 and
443). It is recommended to set the configuration options user
and group
, in
which case cc-backend
will drop root permissions once the ports are taken.
You have to take care, that the ownership of the ./var
folder and
its contents are set accordingly.Workflow to deploy new version
This example assumes the DB and job archive versions did not change.
- Stop systemd service:
sudo systemctl stop clustercockpit.service
- Backup the sqlite DB file! This is as simple as to copy it.
- Copy new
cc-backend
binary to/opt/monitoring/cc-backend/archive
(Tip: Use a date tag likeYYYYMMDD-cc-backend
). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
- Link from
cc-backend
root to current version
ln -s /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
- Start systemd service:
sudo systemctl start clustercockpit.service
- Check if everything is ok:
sudo systemctl status clustercockpit.service
- Check log for issues:
sudo journalctl -u clustercockpit.service
- Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!
4 - Setup of cc-metric-store
Introduction
The cc-metric-store
provides an in-memory metric timeseries cache. It is
configured via a JSON configuration file (config.json
). Metrics are received
via messages using the ClusterCockpit ccMessage protocol.
It can receive messages via a HTTP REST api or by subscribing to a NATS subject.
Requesting data is at the moment only possible via a HTTP REST api.
Configuration
For a complete list of configuration options see here. Minimal example of a configuration file:
{
"metrics": {
"clock": {
"frequency": 60,
"aggregation": "avg"
},
"mem_bw": {
"frequency": 60,
"aggregation": "sum"
},
"flops_any": {
"frequency": 60,
"aggregation": "sum"
},
"flops_dp": {
"frequency": 60,
"aggregation": "sum"
},
"flops_sp": {
"frequency": 60,
"aggregation": "sum"
},
"mem_used": {
"frequency": 60
},
},
"checkpoints": {
"interval": "12h",
"directory": "./var/checkpoints",
"restore": "48h"
},
"archive": {
"interval": "50h",
"directory": "./var/archive"
},
"http-api": {
"address": "localhost:8082"
},
"retention-in-memory": "48h",
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
}
The cc-metric-store
will only accept metrics that are specified in its metric
list. The metric names must exactly match! The frequency for the metrics
specifies how incoming values are binned. If multiple values are received in the
same interval older values are overwritten, if no value is received in an
interval there is a gap. cc-metric-store
can aggregate metrics across
topological entities, e.g., to compute an aggregate node scope value from core
scope metrics. The aggregation attribute specifies how the aggregate value is
computed. Resource metrics usually require sum
, whereas diagnostic metrics
(e.g., clock
) require avg
. For clock
a sum would obviously make no sense.
Metrics that are only available at node scope can omit the aggregation
attribute.
The most important configuration option is the retention-in-memory
setting. It
specifies for which duration back in time metrics should be provided. This
should be long enough to cover common job durations plus a safety margin. This
option also influences the main memory footprint. cc-metric-store
will accept
any scope for any cluster for all configured metrics. The memory footprint scales
with the number of nodes, the number of native metric scopes (cores, sockets),
the number of metrics, and the memory retention time divided by the frequency.
The cc-metric-store
supports checkpoints and archiving. Currently checkpoints
and archives are in JSON format. Checkpoints are always performed on shutdown.
To not loose data on a crash or other failure checkpoints are written regularly
in fixed intervals. The restore option indicates which duration should be loaded
into memory on startup. Usually this should match the retention-in-memory
setting. Checkpoints that are not needed anymore are moved and compressed to an
archive directory in regular intervals. This keeps the raw metric data. There is
currently no support for reading or processing this data. Also we are
considering to replace the current JSON format by a binary file format (e.g.
Apache Arrow). You may want to setup a cron job to delete older archive files.
Finally the http-api
section specifies the address and port on which the
server should listen. Optionally, for HTTPS paths to TLS cert and key files can
be specified. The REST API uses JWT token based authentication. The option
jwt-public-key
provides the public key to check the signed JWT token.
Authentication
For authentication signed (but unencrypted) JWT tokens are used. Only Ed25519/EdDSA cryptographic key-pairs are supported. A client has to sign the token with its private key, on the server side it is checked if the configured public key matches the private key with which the token was signed, if the token was altered after signing, and if the token has expired. All other token attributes are ignored.
We provide an article on how to generate JWT. The is also a background info article on JWT usage in ClusterCockpit. Tokens are cached in cc-metric-store to minimize overhead.
NATS
As an alternative to HTTP REST cc-metric-store
can also receive metrics via
NATS. You find more infos about NATS in this background article.
To enable NATS in cc-metric-store
add the following section to the
configuration file:
{
"nats": [
{
"address": "nats://localhost:4222",
"creds-file-path": "test.creds",
"subscriptions": [
{
"subscribe-to": "ee-hpc-nats",
"cluster-tag": "fritz2"
}
]
}
],
}
5 - Setup of cc-metric-collector
Introduction
cc-metric-collector
is a node agent for measuring, processing and forwarding
node level metrics. It is currently mostly documented via Markdown documents in
its GitHub repository.
The configuration consists of the following parts:
collectors
: Metric sources. There is a large number of collectors available. Important and also most demanding to configure is the likwid collector for measuring hardware performance counter metrics.router
: Rename, drop and modify metrics.sinks
: Configuration where to send the metrics.receivers
: Receive metrics. Useful as a proxy to connect different metric sinks. Can be left empty in most cases.
Build and deploy
Since the cc-metric-collector
needs to be installed on every compute node and
requires configuration specific to the node hardware it is demanding to install
and configure. The Makefile supports to generate RPM and DEB packages. There is
also a Systemd service file included which you may take as a blueprint.
More information on deployment is available here.
Collectors
You may want to have a look at our collector configuration
which includes configurations for many different systems, Intel and AMD CPUs and
NVIDIA GPUs. The general recommendation is to first decide on the metrics you
need and then figure out which collectors are required. For hardware performance
counter metrics you may want to have a look at likwid-perfctr
performance groups
for inspiration on how to computer the required derived metrics on your
target processor architecture.
Router
The router enables to rename, drop and modify metrics. Top level configuration attributes (can be usually be left at default):
interval_timestamp
: Metrics received within same interval get the same identical time stamp if true. Default is true.num_cache_intervals
: Number of intervals that are cached in router. Default is 1. Set to 0 to disable router cache.hostname_tag
: Set a host name different that what is returned byhostname
.max_forward
: Number of metrics read at once from a Golang channel. Default is 50. Option has to be larger than 1. Recommendation: Leave at default!
Below you find the operations that are supported by the message processor.
Rename metrics
To rename metric names add a rename_messages
section mapping the old metric
name to the new name.
"process_messages" : {
"rename_messages" : {
"load_one" : "cpu_load",
"net_bytes_in_bw" : "net_bytes_in",
"net_bytes_out_bw" : "net_bytes_out",
"net_pkts_in_bw" : "net_pkts_in",
"net_pkts_out_bw" : "net_pkts_out",
"ib_recv_bw" : "ib_recv",
"ib_xmit_bw" : "ib_xmit",
"lustre_read_bytes_diff" : "lustre_read_bytes",
"lustre_read_requests_diff" : "lustre_read_requests",
"lustre_write_bytes_diff" : "lustre_write_bytes",
"lustre_write_requests_diff" : "lustre_write_requests",
}
Drop metrics
Sometimes collectors provide a lot of metrics that are not needed. To save
data volume metrics can be dropped. Some collectors also support to exclude
metrics at the collector level using the exclude_metrics
option.
Note
If you are using thecc-metric-store
all metrics that are not configured in
its metric list are also silently dropped."process_messages" : {
"drop_messages" : [
"load_five",
"load_fifteen",
"proc_run",
"proc_total"
],
}
Normalize unit naming
Enforce a consistent naming of units in metrics. This option should always be set to true which is the default. The metric value is not altered!
"process_messages" : {
"normalize_units": true
}
Change metric unit
The collectors usually do not alter the unit of a metric. To change the unit set
the change_uni_prefix
key. The value is automatically scaled correctly,
depending on the old unit prefix.
"process_messages" : {
"change_unit_prefix": {
"name == 'mem_used'": "G",
"name == 'swap_used'": "G",
"name == 'mem_total'": "G",
"name == 'swap_total'": "G",
"name == 'cpufreq'": "M"
}
}
Add tags
To add tags set the add_tags_if
configuration attribute. The following
statement unconditionally sets a cluster name tag for all metrics.
Note
You always want to set the cluster tag if you are usingcc-metric-collector
within the ClusterCockpit framework!"process_messages" : {
"add_tags_if": [
{
"key": "cluster",
"value": "alex",
"if": "true"
}
],
}
Sinks
A simple example configuration for two sinks: HTTP cc-metric-store and NATS:
{
"fritzstore": {
"type": "http",
"url": "http://monitoring.nhr.fau.de:8082/api/write?cluster=fritz",
"jwt": "XYZ",
"idle_connection_timeout": "60s"
},
"fritznats": {
"type": "nats",
"host": "monitoring.nhr.fau.de",
"database": "fritz",
"nkey_file": "/etc/cc-metric-collector/nats.nkey",
}
}
All metrics are concurrently send to all configured sinks.
Note
cc-metric-store
only accepts timestamps in seconds6 - Setup of cc-backend
Introduction
cc-backend
is the main hub within the ClusterCockpit framework. Its
configuration consists of the general part in config.json
and the cluster
configurations in cluster.json
files, that are part of the
job archive.
The job archive is a long-term persistent storage for all job meta and metric data.
The job meta data including job statistics as well as the user data are stored
in a SQL database. Secrets as passwords and tokens are provided as environment
variables. Environment variables can be initialized using a .env
file residing
in the same directory as cc-backend
. If using an .env
file environment
variables that are already set take precedence.
Note (cc-backend before v1.5.0)
For versions before v1.5.0 the.env
file was the only option to set
environment variables, and they could not be set by other means!Configuration
cc-backend
provides a command line switch to generate an initial template for
all required configuration files apart from the job archive:
./cc-backend -init
This will create the ./var
folder, generate initial version of the
config.json
and .env
files, and initialize a sqlite database file.
config.json
Below is a production configuration enabling the following functionality:
- Use HTTPS only
- Mark jobs as short job if smaller than 5m
- Enable authentication and user syncing via an LDAP directory
- Enable to initiate a user session via an JWT token, e.g. by an IDM portal
- Drop permission after privileged ports are taken
- Use compression for metric data files in job archive
- enable re-sampling of timeseries metric data for long jobs
- Configure three clusters using one local
cc-metric-store
- Use a sqlite database (this is the default)
{
"addr": "0.0.0.0:443",
"short-running-jobs-duration": 300,
"ldap": {
"url": "ldaps://hpcldap.rrze.uni-erlangen.de",
"user_base": "ou=people,ou=hpc,dc=rrze,dc=uni-erlangen,dc=de",
"search_dn": "cn=hpcmonitoring,ou=roadm,ou=profile,ou=hpc,dc=rrze,dc=uni-erlangen,dc=de",
"user_bind": "uid={username},ou=people,ou=hpc,dc=rrze,dc=uni-erlangen,dc=de",
"user_filter": "(&(objectclass=posixAccount))",
"sync_interval": "24h"
},
"jwts": {
"syncUserOnLogin": true,
"updateUserOnLogin":true,
"validateUser": false,
"trustedIssuer": "https://portal.hpc.fau.de/",
"max-age": "168h"
},
"https-cert-file": "/etc/letsencrypt/live/monitoring.nhr.fau.de/fullchain.pem",
"https-key-file": "/etc/letsencrypt/live/monitoring.nhr.fau.de/privkey.pem",
"user": "clustercockpit",
"group": "clustercockpit",
"archive": {
"kind": "file",
"path": "./var/job-archive",
"compression": 7,
"retention": {
"policy": "none"
}
},
"enable-resampling": {
"trigger": 30,
"resolutions": [
600,
300,
120,
60
]
},
"emission-constant": 317,
"clusters": [
{
"name": "fritz",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://localhost:8082",
"token": "XYZ"
},
"filterRanges": {
"numNodes": { "from": 1, "to": 64 },
"duration": { "from": 0, "to": 86400 },
"startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
}
},
{
"name": "alex",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://localhost:8082",
"token": "XYZ"
},
"filterRanges": {
"numNodes": { "from": 1, "to": 64 },
"duration": { "from": 0, "to": 86400 },
"startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
}
},
{
"name": "woody",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://localhost:8082",
"token": "XYZ"
},
"filterRanges": {
"numNodes": { "from": 1, "to": 1 },
"duration": { "from": 0, "to": 172800 },
"startTime": { "from": "2020-01-01T00:00:00Z", "to": null }
}
}
]
}
The cluster names have to match the clusters configured in the job-archive. The filter ranges in the cluster configuration affect the filter UI limits in frontend views and should reflect your typical job properties.
Further reading:
Job archive
In case you place the job-archive in the ./var
folder create the folder with:
mkdir -p ./var/job-archive
The job-archive is versioned, the current version is documented in the Release Notes. Currently you have to create the version file manually when initializing the job-archive:
echo 2 > ./var/job-archive/version.txt
Directory layout
ClusterCockpit supports multiple clusters, for each cluster you need to create a
directory named after the cluster and a cluster.json
file specifying the metric
list and hardware partitions within the clusters. Hardware partitions are
subsets of a cluster with homogeneous hardware (CPU type, memory capacity, GPUs)
that are called subclusters in ClusterCockpit.
For above configuration the job archive directory hierarchy looks like the following:
./var/job-archive/
version.txt
fritz/
cluster.json
alex/
cluster.json
woody/
cluster.json
cluster.json
: Basics
The cluster.json
file contains three top level parts: the name of the cluster,
the metric configuration, and the subcluster list.
You find the latest cluster.json
schema
here.
Basic layout of cluster.json
files:
{
"name": "fritz",
"metricConfig": [
{
"name": "cpu_load",
...
},
{
"name": "mem_used",
...
}
],
"subClusters": [
{
"name": "main",
...
},
{
"name": "spr",
...
}
]
}
cluster.json
: Metric configuration
Example for a metric list entry with only the required attributes:
"metricConfig": [
{
"name": "flops_sp",
"unit": {
"base": "Flops/s",
"prefix": "G"
},
"scope": "hwthread",
"timestep": 60,
"aggregation": "sum",
"peak": 5600,
"normal": 1000,
"caution": 200,
"alert": 50
}
]
Explanation of required attributes:
name
: The metric name. This must match the metric name incc-metric-store
!unit
: The metrics unit. Base can be:B
(for bytes),F
(for flops),B/s
,F/s
,CPI
(for cycles per instruction),IPC
(for instructions per cycle),Hz
,W
(for Watts),°C
, or empty string for no unit. Prefix can be:K
,M
,G
,T
,P
, orE
.scope
: The native metric measurement resolution. Can benode
,socket
,memoryDomain
,core
,hwthread
, oraccelerator
.timestep
: The measurement frequency in secondsaggregation
: How the metric is aggregated with in node topology. Can be one ofsum
,avg
, or empty string for no aggregation (node level metrics).- Metric thresholds. If threshold applies for larger or smaller values depends
on optional attribute
lowerIsBetter
(default false).peak
: The maximum possible metric valuenormal
: A common metric value levelcaution
: Metric value requires attentionalert
: Metric value requiring immediate attention
Optional attributes:
footprint
: Is this a job footprint metric. Set to how the footprint is aggregated: Canavg
,min
, ormax
. Footprint metrics are shown in the footprint UI component and job view polar plot.energy
: Should the metric be used to calculate the job energy. Can bepower
(metric has unit Watts) orenergy
(metric has unit Joules).lowerIsBetter
: Is lower better. Influences frontend UI and evaluation of metric thresholds.subClusters
(Type: array of objects): Overwrites for specific subClusters. The metrics per default are valid for all subClusters. It is possible to overwrite or remove metrics for specific subClusters. If a metric is overwritten for a subClusters all attributes have to be set, partial overwrites are not supported. Example for a metric overwrite:
{
"name": "mem_used",
"unit": {
"base": "B",
"prefix": "G"
},
"scope": "node",
"aggregation": "sum",
"footprint": "max",
"timestep": 60,
"lowerIsBetter": true,
"peak": 256,
"normal": 128,
"caution": 200,
"alert": 240,
"subClusters": [
{
"name": "spr1tb",
"footprint": "max",
"peak": 1024,
"normal": 512,
"caution": 900,
"alert": 1000
},
{
"name": "spr2tb",
"footprint": "max",
"peak": 2048,
"normal": 1024,
"caution": 1800,
"alert": 2000
}
]
},
This metric characterizes the memory capacity used by a job. Aggregation for a job is the sum of all node values. As footprint the largest allocated memory capacity is used. For this configuration lower is better is set, which results in jobs with more than the metric thresholds are marked. There exist two subClusters with 1TB and 2TB memory capacity compared to the default 256GB.
Example for removing metrics for a subcluster:
{
"name": "vectorization_ratio",
"unit": {
"base": ""
},
"scope": "hwthread",
"aggregation": "avg",
"timestep": 60,
"peak": 100,
"normal": 60,
"caution": 40,
"alert": 10,
"subClusters": [
{
"name": "icelake",
"remove": true
}
]
}
cluster.json
: subcluster configuration
SubClusters in ClusterCockpit are subsets of a cluster with homogeneous hardware. The subCluster part specifies the node topology, a list of nodes that are part of a subClusters, and the node capabilities that are used to draw the roofline diagrams.
Here is an example:
{
"name": "icelake",
"nodes": "w22[01-35],w23[01-35],w24[01-20],w25[01-20]",
"processorType": "Intel Xeon Gold 6326",
"socketsPerNode": 2,
"coresPerSocket": 16,
"threadsPerCore": 1,
"flopRateScalar": {
"unit": {
"base": "F/s",
"prefix": "G"
},
"value": 432
},
"flopRateSimd": {
"unit": {
"base": "F/s",
"prefix": "G"
},
"value": 9216
},
"memoryBandwidth": {
"unit": {
"base": "B/s",
"prefix": "G"
},
"value": 350
},
"topology": {
"node": [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71
],
"socket": [
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 ],
[ 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71 ]
],
"memoryDomain": [
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ],
[ 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 ],
[ 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53 ],
[ 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71 ]
],
"core": [
[ 0 ], [ 1 ], [ 2 ], [ 3 ], [ 4 ], [ 5 ], [ 6 ], [ 7 ], [ 8 ], [ 9 ], [ 10 ],
[ 11 ], [ 12 ], [ 13 ], [ 14 ], [ 15 ], [ 16 ], [ 17 ], [ 18 ], [ 19 ], [ 20 ],
[ 21 ], [ 22 ], [ 23 ], [ 24 ], [ 25 ], [ 26 ], [ 27 ], [ 28 ], [ 29 ], [ 30 ],
[ 31 ], [ 32 ], [ 33 ], [ 34 ], [ 35 ], [ 36 ], [ 37 ], [ 38 ], [ 39 ], [ 40 ],
[ 41 ], [ 42 ], [ 43 ], [ 44 ], [ 45 ], [ 46 ], [ 47 ], [ 48 ], [ 49 ], [ 50 ],
[ 51 ], [ 52 ], [ 53 ], [ 54 ], [ 55 ], [ 56 ], [ 57 ], [ 58 ], [ 59 ], [ 60 ],
[ 61 ], [ 62 ], [ 63 ], [ 64 ], [ 65 ], [ 66 ], [ 67 ], [ 68 ], [ 69 ], [ 70 ], [ 71 ]
]
}
}
Since it is tedious to write this by hand, we provide a
Perl script
as part of cc-backend
that generates a subCluster template. This script only
works if the LIKWID
tools are installed and in the PATH.
The resource ID for cores is the OS processor ID. For GPUs we recommend to use the PCI-E address as resource ID.
Here is an example for a subCluster with GPU accelerators:
{
"name": "a100m80",
"nodes": "a[0531-0537],a[0631-0633],a0731,a[0831-0833],a[0931-0934]",
"processorType": "AMD Milan",
"socketsPerNode": 2,
"coresPerSocket": 64,
"threadsPerCore": 1,
"flopRateScalar": {
"unit": {
"base": "F/s",
"prefix": "G"
},
"value": 432
},
"flopRateSimd": {
"unit": {
"base": "F/s",
"prefix": "G"
},
"value": 9216
},
"memoryBandwidth": {
"unit": {
"base": "B/s",
"prefix": "G"
},
"value": 400
},
"topology": {
"node": [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
],
"socket": [
[
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63
],
[
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
]
],
"memoryDomain": [
[
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127
]
],
"core": [
[ 0 ], [ 1 ], [ 2 ], [ 3 ], [ 4 ], [ 5 ], [ 6 ], [ 7 ], [ 8 ], [ 9 ], [ 10 ], [ 11 ],
[ 12 ], [ 13 ], [ 14 ], [ 15 ], [ 16 ], [ 17 ], [ 18 ], [ 19 ], [ 20 ], [ 21 ], [ 22 ],
[ 23 ], [ 24 ], [ 25 ], [ 26 ], [ 27 ], [ 28 ], [ 29 ], [ 30 ], [ 31 ], [ 32 ], [ 33 ],
[ 34 ], [ 35 ], [ 36 ], [ 37 ], [ 38 ], [ 39 ], [ 40 ], [ 41 ], [ 42 ], [ 43 ], [ 44 ],
[ 45 ], [ 46 ], [ 47 ], [ 48 ], [ 49 ], [ 50 ], [ 51 ], [ 52 ], [ 53 ], [ 54 ], [ 55 ],
[ 56 ], [ 57 ], [ 58 ], [ 59 ], [ 60 ], [ 61 ], [ 62 ], [ 63 ], [ 64 ], [ 65 ], [ 66 ],
[ 67 ], [ 68 ], [ 69 ], [ 70 ], [ 71 ], [ 73 ], [ 74 ], [ 75 ], [ 76 ], [ 77 ], [ 78 ],
[ 79 ], [ 80 ], [ 81 ], [ 82 ], [ 83 ], [ 84 ], [ 85 ], [ 86 ], [ 87 ], [ 88 ], [ 89 ],
[ 90 ], [ 91 ], [ 92 ], [ 93 ], [ 94 ], [ 95 ], [ 96 ], [ 97 ], [ 98 ], [ 99 ], [ 100 ],
[ 101 ], [ 102 ], [ 103 ], [ 104 ], [ 105 ], [ 106 ], [ 107 ], [ 108 ], [ 109 ], [ 110 ],
[ 111 ], [ 112 ], [ 113 ], [ 114 ], [ 115 ], [ 116 ], [ 117 ], [ 118 ], [ 119 ], [ 120 ],
[ 121 ], [ 122 ], [ 123 ], [ 124 ], [ 125 ], [ 126 ], [ 127 ]
],
"accelerators": [
{
"id": "00000000:0E:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:13:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:49:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:4F:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:90:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:96:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:CC:00.0",
"type": "Nvidia GPU",
"model": "A100"
},
{
"id": "00000000:D1:00.0",
"type": "Nvidia GPU",
"model": "A100"
}
]
}
}
You have to ensure that the metric collector also uses the PCI-E address as a resource ID.
Environment variables
Secrets are provided in terms of environment variables. The only two required
secrets are JWT_PUBLIC_KEY
and JWT_PRIVATE_KEY
used for signing generated
JWT tokens and validate JWT authentication.
Please refer to the environment reference for details.