This is the multi-page printable view of this section. Click here to print.
Tutorials
1 - Plan overall ClusterCockpit architecture
Introduction
When deploying ClusterCockpit in production, two key architectural decisions need to be made:
- Transport mechanism: How metrics flow from collectors to the metric store (REST API vs NATS)
- Metric store deployment: Where the metric store runs (internal to cc-backend vs external standalone)
This guide helps you understand the trade-offs to make informed decisions based on your cluster size, administrative capabilities, and requirements. You may also use a mix of above options.
Transport: REST API vs NATS
The cc-metric-collector can send metrics to cc-metric-store using either direct HTTP REST API calls or via NATS messaging.
REST API Transport
With REST transport, each collector node sends HTTP POST requests directly to the metric store endpoint.
┌─────────────┐ HTTP POST ┌──────────────────┐
│ Collector │ ─────────────────► │ cc-metric-store │
│ (Node 1) │ │ │
└─────────────┘ │ │
┌─────────────┐ HTTP POST │ │
│ Collector │ ─────────────────► │ │
│ (Node 2) │ └──────────────────┘
└─────────────┘
...
Advantages:
- Simple setup with no additional infrastructure
- Direct point-to-point communication
- Easy to debug and monitor
- Works well for smaller clusters (< 500 nodes)
Disadvantages:
- Each collector needs direct network access to metric store
- No buffering: if metric store is unavailable, metrics are lost
- Scales linearly with node count (many concurrent connections)
- Higher load on metric store during burst scenarios
NATS Transport
With NATS, collectors publish metrics to a NATS server, and the metric store subscribes to receive them.
┌─────────────┐ ┌─────────────┐
│ Collector │ ──► publish ──► │ │
│ (Node 1) │ │ │
└─────────────┘ │ NATS │ subscribe ┌──────────────────┐
┌─────────────┐ │ Server │ ◄───────────────► │ cc-metric-store │
│ Collector │ ──► publish ──► │ │ └──────────────────┘
│ (Node 2) │ │ │
└─────────────┘ └─────────────┘
...
Advantages:
- Decoupled architecture: collectors and metric store are independent
- Built-in buffering and message persistence
- Better scalability for large clusters (1000+ nodes)
- Supports multiple subscribers (e.g., external metric store for redundancy)
- Collectors continue working even if metric store is temporarily down
- Lower connection overhead (single connection per collector to NATS)
- Integrated key management via NKeys (Ed25519-based authentication):
- No need to generate and distribute JWT tokens to each collector
- Centralized credential management in NATS server configuration
- Support for accounts with fine-grained publish/subscribe permissions
- Credential revocation without redeploying collectors
- Simpler key rotation compared to JWT token redistribution
Disadvantages:
- Additional infrastructure component to deploy and maintain
- More complex initial setup and configuration
- Additional point of failure (NATS server)
- Requires NATS expertise for troubleshooting
Recommendation
| Cluster Size | Recommended Transport |
|---|---|
| < 200 nodes | REST API |
| 200-500 nodes | Either (based on preference) |
| > 500 nodes | NATS |
For large clusters or environments requiring high availability, NATS provides better resilience and scalability. For smaller deployments or when minimizing complexity is important, REST API is sufficient.
Metric Store: Internal vs External
The cc-metric-store storage engine can run either integrated within cc-backend (internal) or as a separate standalone service (external).
Internal Metric Store
The metric store runs as part of the cc-backend process, sharing the same configuration and lifecycle.
┌────────────────────────────────────────┐
│ cc-backend │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ Web UI & │ │ metric-store │ │
│ │ GraphQL │ │ (internal) │ │
│ └──────────────┘ └────────────────┘ │
└────────────────────────────────────────┘
│ ▲
▼ │
┌─────────┐ ┌─────────┐
│ Browser │ │Collector│
└─────────┘ └─────────┘
Advantages:
- Single process to deploy and manage
- Unified configuration
- Simplified networking (metrics received on same endpoint)
- Lower resource overhead
- Easier initial setup
Disadvantages:
- Metric store restart requires cc-backend restart and the other way around
- A cc-backend restart can take more then a minute since the metric store checkpoints have to be loaded on startup
- Resource contention between web serving and metric ingestion
- No horizontal scaling of metric ingestion
- Single point of failure for entire system
External Metric Store
The metric store runs as a separate process, communicating with cc-backend via its REST API.
┌──────────────────┐ ┌──────────────────┐
│ cc-backend │ ◄─────► │ cc-metric-store │
│ (Web UI/API) │ query │ (external) │
└──────────────────┘ └──────────────────┘
│ ▲
▼ │
┌─────────┐ ┌─────────┐
│ Browser │ │Collector│
└─────────┘ └─────────┘
Advantages:
- Independent scaling and resource allocation
- Can restart metric store without affecting web interface and the other way around
- Enables redundancy with multiple metric store instances
- Better isolation for security and resource management
- Can run on dedicated hardware optimized for in-memory workloads
Disadvantages:
- Two services to deploy and manage
- Separate configuration files
- Additional network communication between components
- More complex setup and monitoring
Recommendation
| Scenario | Recommended Deployment |
|---|---|
| Development/Testing | Internal |
| Small production (< 200 nodes) | Internal |
| Medium production (200-1000 nodes) | Either |
| Large production (> 1000 nodes) | External |
| Resource-constrained head node | External (on dedicated host) |
Security Considerations
Network Exposure
| Component | REST Transport | NATS Transport |
|---|---|---|
| Metric Store | Exposed to all collector nodes | Only exposed to NATS server |
| NATS Server | N/A | Exposed to all collectors and metric stores |
| cc-backend | Exposed to users | Exposed to users |
With NATS, the metric store can be isolated from the cluster network, reducing attack surface. The NATS server becomes the single point of ingress for metrics. Another option to isolate the web backend from the cluster network is to setup cc-metric-collector proxies.
Authentication
- REST API: Uses JWT tokens (Ed25519 signed). Each collector needs a valid token configured and distributed to it.
- NATS: Supports multiple authentication methods:
- Username/password (simple, suitable for smaller deployments)
- NKeys (Ed25519 key pairs managed centrally in NATS server)
- Credential files (
.creds) for decentralized authentication - Accounts for multi-tenancy with isolated namespaces
NKeys Advantage: With NATS NKeys, authentication keys are managed in the NATS server configuration rather than distributed to each collector. This simplifies credential management significantly:
- Add/remove collectors by editing NATS server config
- Revoke access instantly without touching collector nodes
- No JWT token expiration to manage
- Keys can be scoped to specific subjects (publish-only for collectors)
For both transports, ensure:
- Keys are properly generated and securely stored
- TLS is enabled for production deployments
- Network segmentation isolates monitoring traffic
Privilege Separation
Both cc-backend and the external cc-metric-store support dropping
privileges after binding to privileged ports (via user and group
configuration). This limits the impact of potential vulnerabilities.
Performance Considerations
Memory Usage
The metric store keeps data in memory based on retention-in-memory. Memory
usage scales with:
- Number of nodes
- Number of metrics per node
- Number of hardware scopes (cores, sockets, accelerators)
- Retention duration
- Metric frequency
For a 1000-node cluster with 20 metrics at 60-second intervals and 48-hour retention, expect approximately 10-20 GB of memory usage. For larger setups and many core level metrics this can increase up to 100GB, which must fit into main memory.
CPU Usage
- Internal: Competes with cc-backend web serving
- External: Dedicated resources for metric processing
For clusters with high query load (many users viewing job details), external deployment prevents metric ingestion from impacting user experience.
Disk I/O
Checkpoints are written periodically. For large deployments:
- Use fast storage (SSD) for checkpoint directory
- Consider separate disks for checkpoints and archives
- Monitor disk space for archive growth
- ClusterCockpit supports to store archives on an external S3 compatible object store
Example Configurations
Small Cluster (Internal + REST)
Single cc-backend with internal metric store, collectors using REST:
// cc-backend config
{
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 100,
"checkpoints": {
"directory": "./var/checkpoints"
}
}
}
Large Cluster (External + NATS)
Separate cc-metric-store with NATS transport:
// cc-metric-store config
{
"main": {
"addr": "0.0.0.0:8080",
"jwt-public-key": "..."
},
"nats": {
"address": "nats://nats-server:4222",
"username": "ccms",
"password": "..."
},
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 80,
"checkpoints": {
"directory": "/data/checkpoints"
},
"cleanup": {
"mode": "archive",
"directory": "/data/archive"
},
"nats-subscriptions": [
{
"subscribe-to": "hpc-metrics",
"cluster-tag": "mycluster"
}
]
}
}
Decision Checklist
Use this checklist to guide your architecture decision:
- Cluster size: How many nodes need monitoring?
- Availability requirements: Is downtime acceptable?
- Administrative capacity: Can you manage additional services?
- Network topology: Can collectors reach the metric store directly?
- Resource constraints: Is the head node resource-limited?
- Security requirements: Do you need network isolation?
- Growth plans: Will the cluster expand significantly?
For most new deployments, starting with internal metric store and REST transport is recommended. You can migrate to external deployment and/or NATS later as needs evolve.
2 - ClusterCockpit installation manual
Introduction
ClusterCockpit requires the following components:
- A node agent running on all compute nodes that measures required metrics and
forwards all data to a time series metrics database. ClusterCockpit provides
its own node agent
cc-metric-collector. This is the recommended setup, but ClusterCockpit can also be integrated with other node agents, e.g.collectd,prometheusortelegraf. In this case you have to use it with the accompanying time series database and ensure the metric data is send or forwarded tocc-backend. - The api and web interface backend
cc-backend. Only one instance ofcc-backendis required. This will provide the HTTP server at the desired monitoring URL for serving the web interface. It also integrates an in-memory metric store. - A SQL database. The only supported option is to use the builtin sqlite database for ClusterCockpit. It is recommended to setup LiteStream as a service which performs a continuous replication of the sqlite database to multiple storage backends.
- (Optional) Metric store: One or more
cc-metric-storeinstances. Advantages for using an external cc-metric-store are:- Independent scaling and resource allocation
- Can restart metric store without affecting web interface and the other way around
- Enables redundancy with multiple metric store instances
- Better isolation for security and resource management
- Can run on dedicated hardware optimized for in-memory workloads
- (Optional) NATS message broker: Apart from REST APIs ClusterCockpit also
supports NATS as a way to connect components. Using NATS brings a number of
advantages:
- More flexible deployment and testing. Instances can have different URLs or IP addresses. Test instances are easy to deploy in parallel without a need to touch the configuration.
- NATS comes with a builtin sophisticated token key management. This also enables to restrict authorization to specific subjects.
- NATS may provide a larger message throughput compared to REST over HTTP.
- Upcoming ClusterCockpit components as the Energy Manager require NATS.
- A batch job scheduler adapter that provides the job meta information to
cc-backend. This is done by using the provided REST or NATS API for starting and stopping jobs. Currently available adapters:- Slurm: Golang based solution
(cc-slurm-adapter) maintained
by NHR@FAU. This is the recommended option in case you use Slurm. All
functionality in
cc-backendis supported. - Slurm: Python based solution (cc-slurm-sync) maintained by PC2 Paderborn
- HTCondor: cc-condor-sync maintained by Saarland University
- Slurm: Golang based solution
(cc-slurm-adapter) maintained
by NHR@FAU. This is the recommended option in case you use Slurm. All
functionality in
Server Hardware
cc-backend is threaded and therefore profits from multiple cores.
Enough memory is required to hold the metric data cache. For most setups 128GB
should be enough. You can set an upper limit for the memory capacity used by the
internal metric in-memory cache. How much memory is required depends apart from
the resource count also on the frequency of the timeseries data. Starting with
cc-backend v1.5.0 you do not need a safety margin anymore for memory
retention.
It is possible to run it in a virtual machine. For best
performance the ./var folder of cc-backend which contains the sqlite
database file and the file based job archive should be located on a fast storage
device, ideally a NVMe SSD. The sqlite database file and the job archive will
grow over time (if you are not removing old jobs using a retention policy).
Our setup covering multiple clusters over 5 years takes 75GB for the sqlite database
and around 1.4TB for the job archive. In case you have very high job counts, we
recommend to use a retention policy to keep the database and the job archive at
a manageable size. In case you archive old jobs the database can be easily
restored using cc-backend.
It is recommended to run cc-backend as a systemd service. Example systemd unit
files are available in the ClusterCockpit component repositories and in the
cc-examples repository.
Planning and initial configuration
We recommended the following order for planning and configuring a ClusterCockpit installation:
- Decide on overall setup: Initially you have to decide on some fundamental design options about how the components communicate with each other and how the data flows from the compute nodes to the backend.
- Setup your metric list: With two exceptions
you are in general free which metrics you want choose. Those exceptions are:
mem_bwfor main memory bandwidth andflops_anyfor flop throughput (double precision flops are upscaled to single precision rates). The metric list is an integral component for the configuration of all ClusterCockpit components. - Planning of deployment
- Configure and deploy the internal metric store and optionally an external
cc-metric-store - Configure and deploy
cc-metric-collector - Configure and deploy
cc-backend - Configure and deploy
cc-slurm-adapteror another job scheduler adapter of your choice
You can find complete example production configurations in the cc-examples repository.
Common problems
Up front here is a list with common issues people are facing when installing ClusterCockpit for the first time.
Inconsistent metric names across components
At the moment you need to configure the metric list in every component
separately. In cc-metric-collector the metrics that are send to the
cc-backend are determined by the collector configuration and possible
renaming in the router configuration.
In cc-backend for every cluster you need to create a cluster.json
configuration in the job-archive. There you setup which metrics are shown in the
web-frontend including many additional properties for the metrics. For running
jobs cc-backend will query the internal metric-store for exactly those
metric names and if there is no match there will be an error.
We provide a JSON schema based specification as part of the job meta and metric
schema. This specification recommends a minimal set of metrics and we suggest to
use the metric names provided there. While it is up to you if you want to adhere
to the metric names suggested in the schema, there are two exceptions: mem_bw
(main memory bandwidth) and flops_any (total flop rate with DP flops scaled to
SP flops) are required for the roofline plots to work.
Inconsistent device naming between cc-metric-collector and batch job scheduler adapter
The batch job scheduler adapter (e.g. cc-slurm-adapter) provides a list of
resources that are used by the job. cc-backend will query the internal metric-store
with exactly those resource ids for getting all metrics for a job.
As a consequence if cc-metric-collector uses another systematic the metrics
will not be found.
If you have GPU accelerators cc-slurm-adapter should use the PCI-E device
addresses as ids. The option gpuPciAddrs for the nvidia and
rocm-smi collectors in the collector configuration must be configured.
To validate and debug problems you can use the cc-backend or cc-metric-store
debug endpoint:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"
This will return the current state of cc-metric-store. You can search for a
hostname and scroll there for all topology leaf nodes that are available.
Missing nodes in subcluster node lists
ClusterCockpit supports multiple subclusters as part of a cluster. A subcluster
in this context is a homogeneous hardware partition with a dedicated metric
and device configuration. cc-backend dynamically matches the nodes a job runs
on to a subcluster node list to figure out on which subcluster a job is running.
If nodes are missing in a subcluster node list this fails and the metric list
used may be wrong.
3 - Decide on metric list
Introduction
To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.
ClusterCockpit supports multiple clusters within one instance of cc-backend.
You have to create separate metric lists for each of them. In cc-backend the
metric lists are provided as part of the cluster configuration. Every cluster is
configured as part of the
job archive using one
cluster.json file per cluster.
This how-to describes
in-detail how to create a cluster.json file.
Required Metrics
Flop throughput rate: flops_any
Memory bandwidth: mem_bw
Memory capacity used: mem_used
Requested cpu core utilization: cpu_load
Total fast network bandwidth: net_bw
Total file IO bandwidth: file_bw
Recommended CPU Metrics
Instructions throughput in cycles: ipc
User active CPU core utilization: cpu_user
Double precision flop throughput rate: flops_dp
Single precision flop throughput rate: flops_sp
Average core frequency: clock
CPU power consumption: rapl_power
Recommended GPU Metrics
GPU utilization: acc_used
GPU memory capacity used: acc_mem_used
GPU power consumption: acc_power
Recommended node level metrics
Ethernet read bandwidth: eth_read_bw
Ethernet write bandwidth: eth_write_bw
Fast network read bandwidth: ic_read_bw
Fast network write bandwidth: ic_write_bw
File system metrics
Warning
A file system metric tree is currently not yet supported incc-backendIn the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:
- nfs
- lustre
- gpfs
- nvme
- ssd
- hdd
- beegfs
File system read bandwidth: read_bw
File system write bandwidth: write_bw
File system read requests: read_req
File system write requests: write_req
File system inodes used: inodes
File system open and close: accesses
File system file syncs: fsync
File system file creates: create
File system file open: open
File system file close: close
File system file seeks: seek
4 - Deployment
Deployment
Why we do not provide a docker container
The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.It is recommended to install all ClusterCockpit components in a common
directory, e.g. /opt/monitoring, var/monitoring or var/clustercockpit. In
the following we use /opt/monitoring.
A Systemd service runs on the central monitoring server:
- clustercockpit : Binary cc-backend in
/opt/monitoring/cc-backend. - (Optional with external metric-store) cc-metric-store : Binary cc-metric-store in
/opt/monitoring/cc-metric-store.
ClusterCockpit is deployed as a single binary that embeds all static assets.
We recommend keeping all cc-backend binary versions in a folder archive and
linking the currently active one from the cc-backend root.
This allows for easy roll-back in case something doesn’t work.
Please Note
cc-backend is started with root rights to open the privileged ports (80 and
443). It is recommended to set the configuration options user and group, in
which case cc-backend will drop root permissions once the ports are taken.
You have to take care, that the ownership of the ./var folder and
its contents are set accordingly.
You also can run cc-backend behind a reverse proxy. In this case it can be
started with an unprivileged user and the reverse proxy takes care of TLS
encryption. This also enables to automatically show a maintenance page in case
ClusterCockpit is not reachable.Workflow to deploy new version
This example assumes you are deploying ClusterCockpit for the first time or the DB and job archive versions did not change between versions.
- Stop systemd service:
sudo systemctl stop clustercockpit.service
- Backup the sqlite DB file! This is as simple as to copy it. You can also use a continuous replication service as e.g. litestream.
- Copy new
cc-backendbinary to/opt/monitoring/cc-backend/archive(Tip: Use a date tag likeYYYYMMDD-cc-backend). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
- Link from
cc-backendroot to current version
ln -s /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
- If the new version requires a database migration, run it before starting the service:
cd /opt/monitoring/cc-backend
./cc-backend -migrate-db
Check the release notes to find out whether a migration is needed.
- Start systemd service:
sudo systemctl start clustercockpit.service
- Check if everything is ok:
sudo systemctl status clustercockpit.service
- Check log for errors:
sudo journalctl -u clustercockpit.service
- Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!
5 - Setup of cc-metric-store
Introduction
cc-backend integrates an in-memory metric store that is always available. A
standalone cc-metric-store process can additionally be deployed for distributed
setups, redundancy, or dedicated hardware (see
Introduction for a discussion of the trade-offs).
Both share the same storage engine and the same metric-store configuration
keys. Metrics are received via messages using the ClusterCockpit
ccMessage protocol
either via an HTTP REST API or by subscribing to a NATS subject.
Common Metric Store Configuration
The core storage engine is configured identically for both the built-in store in
cc-backend and the standalone cc-metric-store. In cc-backend this block
appears as "metric-store" inside config.json. In the standalone service it
also lives under "metric-store" in its own config.json.
Minimum Configuration
Only two keys are required:
{
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 100
}
}
With this minimal configuration the following defaults apply:
- Checkpoints use the WAL format stored in
./var/checkpoints - Old data is deleted when it ages out of the retention window
- Worker count is determined automatically
Memory and Retention
retention-in-memory is a Go duration string (e.g., "48h", "168h")
specifying how far back metrics are kept in RAM. Choose a value long enough to
cover the expected duration of typical jobs on your system. Memory footprint
scales with the number of nodes, the number of metrics and their native scopes
(cores, sockets, …), and retention-in-memory divided by the metric frequency.
memory-cap sets the approximate upper limit in GB on the memory used for
metric buffers. When this limit is exceeded, buffers belonging to nodes that are
not referenced by any active job are freed first. Setting it to 0 disables the
cap.
Checkpointing
Metrics are persisted to disk as checkpoints so that in-memory data survives restarts. Checkpoints are always written on a clean shutdown and additionally at a configurable interval during normal operation.
{
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 100,
"checkpoints": {
"file-format": "wal",
"directory": "./var/checkpoints"
}
}
}
file-format selects the persistence format:
"wal"(default, recommended) — binary snapshot plus a Write-Ahead Log; crash-safe."json"— human-readable periodic snapshots; easier to inspect and recover manually, but larger on disk and slower to write.
directory sets the path where checkpoint files are written (default:
./var/checkpoints).
max-wal-size (optional integer, bytes) limits the size of a single host’s WAL
file. When exceeded the WAL is force-rotated to prevent unbounded disk growth.
0 means unlimited (default). Only relevant for "wal" format.
checkpoint-interval (optional, Go duration string, e.g., "12h") controls
how often periodic checkpoints are written. The default is "12h". The interval
is also derived from retention-in-memory when not set explicitly.
See Checkpoint Formats below for a detailed description of both formats.
Data Cleanup
Data that ages out of the in-memory retention window can either be discarded or moved to a long-term Parquet archive:
{
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 100,
"cleanup": {
"mode": "archive",
"directory": "./var/archive"
}
}
}
mode accepts:
"delete"(default) — old data is discarded."archive"— old data is written as Parquet files underdirectorybefore being freed.directoryis required whenmodeis"archive".
The cleanup runs at the same interval as retention-in-memory. See
Parquet Archive below for details on the file layout.
Performance Tuning
num-workers (optional integer) controls the number of parallel workers used
for checkpoint and archive I/O. The default of 0 enables automatic sizing:
min(NumCPU/2+1, 10). Increase this on I/O-heavy hosts with many cores.
NATS Metric Ingestion
As an alternative to the HTTP REST ingest endpoint, the metric store can receive
metrics via NATS. This requires a top-level nats section (see below for the
standalone service, or the nats section in cc-backend’s config.json for
the built-in store).
Add nats-subscriptions inside metric-store to enable NATS ingestion:
{
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 100,
"nats-subscriptions": [
{
"subscribe-to": "hpc-nats",
"cluster-tag": "fritz"
}
]
}
}
Each entry specifies:
subscribe-to(required) — the NATS subject to subscribe to.cluster-tag(optional) — a default cluster name applied to incoming messages that do not carry a cluster tag.
External cc-metric-store
The standalone cc-metric-store process requires the common metric-store
block described above plus two additional top-level sections: main (HTTP
server) and metrics (accepted metric list). The optional nats section
enables NATS server connectivity.
Main Section
{
"main": {
"addr": "0.0.0.0:8082",
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
}
}
Required fields:
addr— address and port to listen on.jwt-public-key— Base64-encoded Ed25519 public key for verifying JWT tokens on the REST API.
Optional fields:
https-cert-file/https-key-file— enable HTTPS; both must be set.user/group— drop root permissions to this user/group after binding a privileged port.backend-url— URL ofcc-backend(e.g.,https://localhost:8080); enables dynamic memory retention by querying active job information.
See Authentication below for details on JWT usage.
Metrics Section
The standalone service requires an explicit list of accepted metrics. In
cc-backend this list is derived automatically from the cluster configurations,
but the external service needs it declared.
Only metrics listed here are accepted; all others are silently dropped. The
frequency (seconds) controls the binning interval — if multiple values arrive
within the same interval the most recent one wins; if no value arrives there is
a gap. The aggregation field specifies how values are combined when
synthesising a coarser-scope value from finer-scope measurements (e.g., core →
socket → node):
"sum"— resource metrics whose values add up (bandwidth, flops, power)."avg"— diagnostic metrics that should be averaged (clock frequency, temperature).null— metric is only available at node scope; no cross-scope aggregation.
{
"metrics": {
"clock": { "frequency": 60, "aggregation": "avg" },
"mem_bw": { "frequency": 60, "aggregation": "sum" },
"flops_any": { "frequency": 60, "aggregation": "sum" },
"flops_dp": { "frequency": 60, "aggregation": "sum" },
"flops_sp": { "frequency": 60, "aggregation": "sum" },
"mem_used": { "frequency": 60, "aggregation": null }
}
}
Complete Minimal Example
Combining all required sections for the standalone cc-metric-store:
{
"main": {
"addr": "0.0.0.0:8082",
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
},
"metrics": {
"clock": { "frequency": 60, "aggregation": "avg" },
"mem_bw": { "frequency": 60, "aggregation": "sum" },
"flops_any": { "frequency": 60, "aggregation": "sum" },
"flops_dp": { "frequency": 60, "aggregation": "sum" },
"flops_sp": { "frequency": 60, "aggregation": "sum" },
"mem_used": { "frequency": 60, "aggregation": null }
},
"metric-store": {
"retention-in-memory": "48h",
"memory-cap": 100,
"checkpoints": {
"file-format": "wal",
"directory": "./var/checkpoints"
},
"cleanup": {
"mode": "archive",
"directory": "./var/archive"
}
}
}
NATS Server Connection
To receive metrics via NATS, add a top-level nats section with the server
coordinates alongside nats-subscriptions inside metric-store:
{
"nats": {
"address": "nats://localhost:4222",
"username": "user",
"password": "password"
},
"metric-store": {
"nats-subscriptions": [
{
"subscribe-to": "hpc-nats",
"cluster-tag": "fritz"
}
]
}
}
The nats section configures the NATS server connection:
address(required) — URL of the NATS server.username/password(optional) — credentials for authentication.
You can find background information on NATS in this article.
For a complete list of configuration options see the reference.
Checkpoint Formats
The checkpoints.file-format field controls how in-memory data is persisted to
disk.
"json" — Human-readable JSON snapshots written periodically. Each snapshot
is stored as <dir>/<cluster>/<host>/<timestamp>.json. Easy to inspect and
recover manually, but larger on disk and slower to write.
"wal" (recommended) — Binary Write-Ahead Log format designed for crash
safety. Two file types are used per host:
current.wal— append-only binary log; every incoming data point is appended immediately. Truncated trailing records from unclean shutdowns are silently skipped on restart.<timestamp>.bin— binary snapshot written at each checkpoint interval, containing the complete metric state. Written atomically via a.tmprename.
On startup the most recent .bin snapshot is loaded, then remaining WAL entries
are replayed on top. The WAL is rotated after each successful snapshot. The
"wal" format will become the only supported option in a future release. If you
are migrating from an older installation using JSON checkpoints, switch to "wal"
after a clean restart.
To inspect the contents of .wal or .bin checkpoint files for debugging, use
the binaryCheckpointReader
tool included in cc-backend.
Parquet Archive
When cleanup.mode is "archive", data that ages out of the in-memory
retention window is written to Apache Parquet
files before being freed, organized as:
<cleanup.directory>/
<cluster>/
<timestamp>.parquet
One Parquet file is produced per cluster per cleanup run, consolidating all hosts. Files are compressed with Zstandard and sorted by cluster, hostname, metric, and timestamp for efficient columnar reads.
Authentication
For authentication, signed (but unencrypted) JWT tokens are used. Only Ed25519/EdDSA cryptographic key-pairs are supported. A client has to sign the token with its private key; the server verifies that the configured public key matches the signing key, that the token was not altered after signing, and that the token has not expired. All other token attributes are ignored. Tokens are cached in cc-metric-store to minimise overhead.
We provide an article on how to generate JWT tokens and a background article on JWT usage in ClusterCockpit.
6 - Setup of cc-metric-collector
Introduction
cc-metric-collector is a node agent for measuring, processing and forwarding
node level metrics. It is currently mostly documented via Markdown documents in
its GitHub repository.
The configuration consists of the following parts:
collectors: Metric sources. There is a large number of collectors available. Important and also most demanding to configure is the likwid collector for measuring hardware performance counter metrics.router: Rename, drop and modify metrics.sinks: Configuration where to send the metrics.receivers: Receive metrics. Useful as a proxy to connect different metric sinks. Can be left empty in most cases.
Build and deploy
Since the cc-metric-collector needs to be installed on every compute node and
requires configuration specific to the node hardware it is demanding to install
and configure. The Makefile supports to generate RPM and DEB packages. There is
also a Systemd service file included which you may take as a blueprint.
More information on deployment is available here.
Collectors
You may want to have a look at our collector configuration
which includes configurations for many different systems, Intel and AMD CPUs and
NVIDIA GPUs. The general recommendation is to first decide on the metrics you
need and then figure out which collectors are required. For hardware performance
counter metrics you may want to have a look at likwid-perfctr
performance groups
for inspiration on how to compute the required derived metrics on your
target processor architecture.
Router
The router enables to rename, drop and modify metrics. Top level configuration attributes (can be usually be left at default):
interval_timestamp: Metrics received within same interval get the same identical time stamp if true. Default is true.num_cache_intervals: Number of intervals that are cached in router. Default is 1. Set to 0 to disable router cache.hostname_tag: Set a host name different that what is returned byhostname.max_forward: Number of metrics read at once from a Golang channel. Default is 50. Option has to be larger than 1. Recommendation: Leave at default!
Below you find the operations that are supported by the message processor.
Rename metrics
To rename metric names add a rename_messages section mapping the old metric
name to the new name.
"process_messages" : {
"rename_messages" : {
"load_one" : "cpu_load",
"net_bytes_in_bw" : "net_bytes_in",
"net_bytes_out_bw" : "net_bytes_out",
"net_pkts_in_bw" : "net_pkts_in",
"net_pkts_out_bw" : "net_pkts_out",
"ib_recv_bw" : "ib_recv",
"ib_xmit_bw" : "ib_xmit",
"lustre_read_bytes_diff" : "lustre_read_bytes",
"lustre_read_requests_diff" : "lustre_read_requests",
"lustre_write_bytes_diff" : "lustre_write_bytes",
"lustre_write_requests_diff" : "lustre_write_requests"
}
}
Drop metrics
Sometimes collectors provide a lot of metrics that are not needed. To save
data volume metrics can be dropped. Some collectors also support to exclude
metrics at the collector level using the exclude_metrics option.
Note
If you are using thecc-metric-store all metrics that are not configured in
its metric list are also silently dropped."process_messages" : {
"drop_messages" : [
"load_five",
"load_fifteen",
"proc_run",
"proc_total"
]
}
Normalize unit naming
Enforce a consistent naming of units in metrics. This option should always be set to true which is the default. The metric value is not altered!
"process_messages" : {
"normalize_units": true
}
Change metric unit
The collectors usually do not alter the unit of a metric. To change the unit set
the change_unit_prefix key. The value is automatically scaled correctly,
depending on the old unit prefix.
"process_messages" : {
"change_unit_prefix": {
"name == 'mem_used'": "G",
"name == 'swap_used'": "G",
"name == 'mem_total'": "G",
"name == 'swap_total'": "G",
"name == 'cpufreq'": "M"
}
}
Add tags
To add tags set the add_tags_if configuration attribute. The following
statement unconditionally sets a cluster name tag for all metrics.
Note
You always want to set the cluster tag if you are usingcc-metric-collector
within the ClusterCockpit framework!"process_messages" : {
"add_tags_if": [
{
"key": "cluster",
"value": "alex",
"if": "true"
}
]
}
Sinks
A simple example configuration for two sinks: HTTP cc-metric-store and NATS:
{
"fritzstore": {
"type": "http",
"url": "http://monitoring.nhr.fau.de:8082/api/write?cluster=fritz",
"jwt": "XYZ",
"idle_connection_timeout": "60s"
},
"fritznats": {
"type": "nats",
"host": "monitoring.nhr.fau.de",
"database": "fritz",
"nkey_file": "/etc/cc-metric-collector/nats.nkey",
}
}
All metrics are concurrently send to all configured sinks.
Note
cc-metric-store only accepts timestamps in seconds7 - Setup of cc-backend
Introduction
cc-backend is the main hub within the ClusterCockpit framework. Its
configuration consists of the general part in config.json and the cluster
configurations in cluster.json files, that are part of the
job archive.
The job archive is a long-term persistent storage for all job meta and metric data.
The job meta data including job statistics as well as the user data are stored
in a SQL database. Secrets as passwords and tokens are provided as environment
variables. Environment variables can be initialized using a .env file residing
in the same directory as cc-backend. If using an .env file environment
variables that are already set take precedence.
Note (cc-backend before v1.5.0)
For versions before v1.5.0 the.env file was the only option to set
environment variables, and they could not be set by other means!Configuration
cc-backend provides a command line switch to generate an initial template for
all required configuration files apart from the job archive:
./cc-backend -init
This will create the ./var folder, generate initial version of the
config.json and .env files, and initialize a sqlite database file.
config.json
Below is a production configuration enabling the following functionality:
- Use HTTPS only
- Mark jobs as short job if smaller than 5m
- Enable authentication and user syncing via an LDAP directory
- Enable to initiate a user session via an JWT token, e.g. by an IDM portal
- Drop permission after privileged ports are taken
- enable re-sampling of time-series metric data for long jobs
- Enable NATS for job and metric store APIs
- Set metric in memory retention to 48h
- Set upper memory capping for internal metric store to 100GB
- Enable archiving of metric data
- Using S3 as job archive backend. Note: The file based archive in
./var/job-archiveis the default.
Not included below but set by the default settings:
- Use compression for metric data files in job archive
- Allow access to the REST API from all IPs
{
"main": {
"addr": "0.0.0.0:443",
"https-cert-file": "/etc/letsencrypt/live/url/fullchain.pem",
"https-key-file": "/etc/letsencrypt/live/url/privkey.pem",
"user": "clustercockpit",
"group": "clustercockpit",
"short-running-jobs-duration": 300,
"enable-job-taggers": true,
"resampling": {
"minimum-points": 600,
"trigger": 180,
"resolutions": [240, 60]
},
"api-subjects": {
"subject-job-event": "cc.job.event",
"subject-node-state": "cc.node.state"
}
},
"nats": {
"address": "nats://x.x.x.x:4222",
"username": "root",
"password": "root"
},
"auth": {
"jwts": {
"max-age": "2000h"
},
"ldap": {
"url": "ldaps://hpcldap.rrze.uni-erlangen.de",
"user-base": "ou=people,ou=hpc,dc=rz,dc=uni,dc=de",
"search-dn": "cn=hpcmonitoring,ou=roadm,ou=profile,ou=hpc,dc=rz,dc=uni,dc=de",
"user-bind": "uid={username},ou=people,ou=hpc,dc=rrze,dc=uni,dc=de",
"user-filter": "(&(objectclass=posixAccount))",
"sync-interval": "24h"
}
},
"cron": {
"commit-job-worker": "1m",
"duration-worker": "5m",
"footprint-worker": "10m"
},
"archive": {
"kind": "s3",
"endpoint": "http://x.x.x.x",
"bucket": "jobarchive",
"access-key": "xx",
"secret-key": "xx",
"retention": {
"policy": "move",
"age": 365,
"target-path": "./var/archive"
}
},
"metric-store": {
"memory-cap": 100,
"retention-in-memory": "48h",
"cleanup": {
"mode": "archive",
"directory": "./var/archive"
},
"nats-subscriptions": [
{
"subscribe-to": "hpc-nats",
"cluster-tag": "fritz"
},
{
"subscribe-to": "hpc-nats",
"cluster-tag": "alex"
}
]
},
"ui-file": "ui-config.json"
}
The metric-store block configures the built-in in-memory metric store. The
same options are used by the standalone cc-metric-store service. See the
metric store setup tutorial for a full description of
all metric-store configuration options.
Further reading:
Environment variables
Secrets are provided in terms of environment variables. The only two required
secrets are JWT_PUBLIC_KEY and JWT_PRIVATE_KEY used for signing generated
JWT tokens and validate JWT authentication.
Please refer to the environment reference for details.
8 - Setup of cc-slurm-adapter
Introduction
cc-slurm-adapter is a daemon that feeds cc-backend with job information from
Slurm in real-time. It runs on the same node as
slurmctld and queries Slurm via sacct, squeue, sacctmgr, and scontrol.
slurmrestd is not used and not required, but slurmdbd is mandatory.
The adapter periodically synchronises the Slurm job state with cc-backend via
REST API and can optionally publish job events over
NATS. It is designed to be fault-tolerant: if cc-backend
or Slurm is temporarily unavailable no jobs are lost — they are submitted as
soon as everything is running again.
Limitations
scontrol resource allocation information (node lists, CPU/GPU assignment) is
only available for a few minutes after a job finishes. If the adapter is stopped
during that window the affected jobs will be registered in cc-backend without
resource details, meaning metric data cannot be associated to those jobs. Keep
downtime of the adapter as short as possible.
Compatible Slurm versions: 24.xx.x and 25.xx.x.
Installation
Build from source
git clone https://github.com/ClusterCockpit/cc-slurm-adapter
cd cc-slurm-adapter
make
Copy the resulting binary and a configuration file to a suitable location, for
example /opt/cc-slurm-adapter/. Because the configuration file contains
a JWT token and optionally NATS credentials, restrict its permissions:
install -m 0750 -o cc-slurm-adapter -g slurm \
cc-slurm-adapter /opt/cc-slurm-adapter/
install -m 0640 -o cc-slurm-adapter -g slurm \
config.json /opt/cc-slurm-adapter/
Configuration
The adapter reads a single JSON configuration file specified with -config.
Minimum configuration
Only ccRestUrl and ccRestJwt are required:
{
"ccRestUrl": "https://my-cc-backend.example",
"ccRestJwt": "eyJ..."
}
With these two keys all other options take their default values:
| Key | Default |
|---|---|
pidFilePath | /run/cc-slurm-adapter/daemon.pid |
prepSockListenPath | /run/cc-slurm-adapter/daemon.sock |
prepSockConnectPath | /run/cc-slurm-adapter/daemon.sock |
lastRunPath | /var/lib/cc-slurm-adapter/lastrun |
slurmPollInterval | 60 s |
slurmQueryDelay | 1 s |
slurmQueryMaxSpan | 604800 s (7 days) |
slurmMaxRetries | 10 |
slurmMaxConcurrent | 10 |
ccPollInterval | 21600 s (6 h) |
ccRestSubmitJobs | true |
natsPort | 4222 |
natsSubject | "jobs" |
Polling and synchronisation
slurmPollInterval (seconds) controls how often the adapter performs a full
Slurm ↔ cc-backend sync. The default of 60 s is a reasonable starting point;
production sites often raise this to 300 s when the Prolog/Epilog hook is in use
(see below) because real-time events already cover most latency requirements.
slurmQueryMaxSpan limits how far back in time the adapter looks for jobs on
each sync. The default of 7 days prevents accidental flooding when the adapter
has been offline for an extended period. Set this to a shorter value (e.g.,
86400 for 24 h) on busy clusters.
ccPollInterval triggers a full query of active jobs from cc-backend to
detect stuck jobs. It does not need to run often; the default of 6 h is usually
fine.
GPU PCI addresses
cc-backend identifies GPU devices by their PCI bus address. The gpuPciAddrs
map associates a hostname regular expression with the ordered list of PCI
addresses for that group of nodes — ordered the same way as NVML (which matches
nvidia-smi output when all devices are visible):
{
"gpuPciAddrs": {
"^node[0-9]{3}$": [
"00000000:01:00.0",
"00000000:25:00.0",
"00000000:41:00.0",
"00000000:61:00.0"
]
}
}
If a cluster has several node groups with different GPU layouts, use one regex entry per group. See the production examples below.
Ignoring hosts
ignoreHosts is a regular expression matched against hostnames. If all
hosts of a job match, the job is discarded and not reported to cc-backend.
Useful to exclude visualisation or login nodes that may appear in Slurm
allocations:
{
"ignoreHosts": "^viznode1$"
}
NATS
When a NATS server is configured, the adapter publishes job start and stop
events to the specified subject. cc-backend can then pick these up instead of
waiting for the REST path. See the
NATS background article for context.
{
"natsServer": "nats.example",
"natsPort": 4222,
"natsSubject": "mycluster",
"natsUser": "mycluster",
"natsPassword": "secret"
}
When NATS is used and cc-backend is configured to register jobs via NATS, you
can set ccRestSubmitJobs to false to disable the REST job-submission path
entirely and rely solely on NATS.
For alternative NATS authentication methods:
natsCredsFile— path to a NATS credentials filenatsNKeySeedFile— path to a file containing an NKey seed (private key)
Systemd Service
Create /etc/systemd/system/cc-slurm-adapter.service:
[Unit]
Description=cc-slurm-adapter
Wants=network.target
After=network.target
[Service]
User=cc-slurm-adapter
Group=slurm
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter \
-daemon \
-config /opt/cc-slurm-adapter/config.json
WorkingDirectory=/opt/cc-slurm-adapter/
RuntimeDirectory=cc-slurm-adapter
RuntimeDirectoryMode=0750
Restart=on-failure
RestartSec=15s
[Install]
WantedBy=multi-user.target
RuntimeDirectory=cc-slurm-adapter instructs systemd to create and own
/run/cc-slurm-adapter/ which holds the PID file and the Prolog/Epilog Unix
socket. RuntimeDirectoryMode=0750 with Group=slurm allows the slurm user
(which executes Prolog/Epilog scripts) to connect to the socket.
Enable and start the service:
systemctl daemon-reload
systemctl enable --now cc-slurm-adapter
Slurm User Permissions
Depending on your Slurm configuration, an unprivileged user cannot run sacct
or scontrol to query all jobs. Grant the cc-slurm-adapter user operator-level
access:
sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator
Warning
If the required Slurm permissions are not granted, no jobs will be reported to cc-backend.Prolog/Epilog Hook (Optional)
The periodic sync has a latency up to slurmPollInterval seconds. To reduce
this, configure slurmctld to call the adapter immediately when a job starts or
ends. Add to slurm.conf:
PrEpPlugins=prep/script
PrologSlurmctld=/opt/cc-slurm-adapter/hook.sh
EpilogSlurmctld=/opt/cc-slurm-adapter/hook.sh
Create /opt/cc-slurm-adapter/hook.sh (executable, readable by the slurm
group):
#!/bin/sh
/opt/cc-slurm-adapter/cc-slurm-adapter
exit 0
The script must exit with 0. A non-zero exit code causes Slurm to deny the job allocation. If the adapter is temporarily stopped or being restarted, the Prolog/Epilog call will fail silently (exit 0) and the periodic sync will catch the job on the next tick.
If you changed prepSockConnectPath from its default you must pass -config to
the hook invocation and ensure the configuration file is readable by the slurm
group:
#!/bin/sh
/opt/cc-slurm-adapter/cc-slurm-adapter -config /opt/cc-slurm-adapter/config.json
exit 0
The slurmQueryDelay option (default 1 s) adds a short pause between the
Prolog/Epilog event and the actual Slurm query to give Slurm time to write the
job record. There is generally no need to change this.
Production Examples
CPU-only cluster, no GPUs (woody)
{
"ccRestUrl": "https://monitoring.example",
"ccRestJwt": "eyJ...",
"lastRunPath": "/var/lib/cc-slurm-adapter/lastrun",
"slurmPollInterval": 300,
"slurmQueryMaxSpan": 86400,
"natsServer": "monitoring.example",
"natsSubject": "woody",
"natsUser": "woody",
"natsPassword": "secret"
}
GPU cluster, single node type (fritz)
{
"ccRestUrl": "https://monitoring.example",
"ccRestJwt": "eyJ...",
"lastRunPath": "/var/lib/cc-slurm-adapter/lastrun",
"slurmPollInterval": 300,
"ignoreHosts": "^viznode1$",
"natsServer": "monitoring.example",
"natsSubject": "fritz",
"natsUser": "fritz",
"natsPassword": "secret",
"gpuPciAddrs": {
"^gpunode\\d+$": [
"00000000:CE:00.0",
"00000000:CF:00.0",
"00000000:D0:00.0",
"00000000:D1:00.0"
]
}
}
GPU cluster, multiple node types with different GPU layouts (alex)
Nodes are divided into groups by hostname pattern. Each group has a distinct set of GPU PCI addresses:
{
"ccRestUrl": "https://monitoring.example",
"ccRestJwt": "eyJ...",
"lastRunPath": "/var/lib/cc-slurm-adapter/lastrun",
"slurmPollInterval": 300,
"natsServer": "monitoring.example",
"natsSubject": "alex",
"natsUser": "alex",
"natsPassword": "secret",
"gpuPciAddrs": {
"^(a0[1-4]\\d\\d|a052\\d|a162\\d|a172\\d)$": [
"00000000:01:00.0",
"00000000:25:00.0",
"00000000:41:00.0",
"00000000:61:00.0",
"00000000:81:00.0",
"00000000:A1:00.0",
"00000000:C1:00.0",
"00000000:E1:00.0"
],
"^(a0[6-9]\\d\\d|a053\\d)$": [
"00000000:0E:00.0",
"00000000:13:00.0",
"00000000:49:00.0",
"00000000:4F:00.0",
"00000000:90:00.0",
"00000000:96:00.0",
"00000000:CC:00.0",
"00000000:D1:00.0"
]
}
}
Debugging
cc-slurm-adapter writes all output to stderr, which systemd captures in the
journal:
journalctl -u cc-slurm-adapter -f
To increase verbosity, change the ExecStart line to add -debug 5:
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter \
-daemon \
-config /opt/cc-slurm-adapter/config.json \
-debug 5
Log level 5 enables detailed per-job trace output and is useful for diagnosing
why specific jobs are not appearing in cc-backend.
To verify that the adapter can query Slurm correctly, run the following as the
cc-slurm-adapter user:
sacct --allusers --json | head -5
squeue --json | head -5
If either command fails with a permission error, revisit the Slurm user permissions step.