Decide on metric list
Categories:
Introduction
To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.
ClusterCockpit supports multiple clusters within one instance of cc-backend
.
You have to create separate metric lists for each of them. In cc-backend
the
metric lists are provided as part of the cluster configuration. Every cluster is
configured as part of the
job archive using one
cluster.json
file per cluster.
This how-to describes
in-detail how to create a cluster.json
file.
Required Metrics
Flop throughput rate: flops_any
Memory bandwidth: mem_bw
Memory capacity used: mem_used
Requested cpu core utilization: cpu_load
Total fast network bandwidth: net_bw
Total file IO bandwidth: file_bw
Recommended CPU Metrics
Instructions throughput in cycles: ipc
User active CPU core utilization: cpu_user
Double precision flop throughput rate: flops_dp
Single precision flop throughput rate: flops_sp
Average core frequency: clock
CPU power consumption: rapl_power
Recommended GPU Metrics
GPU utilization: acc_used
GPU memory capacity used: acc_mem_used
GPU power consumption: acc_power
Recommended node level metrics
Ethernet read bandwidth: eth_read_bw
Ethernet write bandwidth: eth_write_bw
Fast network read bandwidth: ic_read_bw
Fast network write bandwidth: ic_write_bw
File system metrics
Warning
A file system metric tree is currently not yet supported incc-backend
In the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:
- nfs
- lustre
- gpfs
- nvme
- ssd
- hdd
- beegfs
File system read bandwidth: read_bw
File system write bandwidth: write_bw
File system read requests: read_req
File system write requests: write_req
File system inodes used: inodes
File system open and close: accesses
File system file syncs: fsync
File system file creates: create
File system file open: open
File system file close: close
File system file syncs: seek
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.