InfluxDB Line Protocol

Specification of the InfluxDB line-protocol flavor used for messaging between ClusterCockpit components, covering metrics, events, and control messages.

Overview

ClusterCockpit uses an InfluxData line-protocol flavor for transferring messages between its components. All messages share the same text-based format:

<measurement>,<tag_set> <field_set> <timestamp>

Where <tag_set> and <field_set> are comma-separated lists of key=value entries. The timestamp is Unix epoch time in seconds.

Message Categories

Three message categories are distinguished by their field key:

CategoryField KeyField TypePurpose
Metricvalue=<number>float/integerPerformance metric time series
Eventevent="<json>"string (JSON)Actionable job and cluster events
Controlcontrol="<string>"stringComponent configuration requests

NATS Subject Hierarchy

ClusterCockpit uses NATS for messaging. The subject hierarchy lets components subscribe only to the message types they need:

<cluster name>. |
                --- metrics
                |
                --- events.[job, slurm]
                |
                --- control.[get, put]

Tags

Mandatory Tags

Every message — regardless of category — must include:

TagDescriptionValues
hostnameSource node hostnamee.g., node01
typeHardware scopenode, socket, die, memoryDomain, llc, core, hwthread, accelerator
type-idComponent index within the typee.g., 0, 1, 2

Although type-id is not strictly required when type=node, sending type=node,type-id=0 is recommended for consistency.

Optional Tags

Some message types require additional tags:

  • function — for Event messages: the event purpose, e.g., start_job, stop_job
  • method — for Control messages: GET or PUT

For sub-typing (e.g., filesystem name or device path), use stype and stype-id rather than free-form tag names:

# Preferred
stype=filesystem,stype-id=/homes

# Avoid
filesystem=/homes

Metric Messages

Identification: value=<number> field where the value is a float or integer.

The measurement name is the metric name. While metric names can be chosen freely, the following core metrics should be present in any ClusterCockpit-compatible system:

MetricDescriptionUnit
flops_spSingle-precision floating point rateFlops/s
flops_dpDouble-precision floating point rateFlops/s
flops_anyCombined floating point rateFlops/s
cpu_load1-minute load average (/proc/loadavg)
mem_usedMemory used by applications (/proc/meminfo)Bytes
ipcInstructions per cycle
mem_bwMain memory bandwidth (read + write)MB/s
cpu_powerCPU package power consumptionW
mem_powerMemory subsystem power consumptionW
clockCPU clock frequencyMHz

For the complete metric list see the job-data schema reference.

Example:

flops_any,hostname=e1208,type=core,type-id=23 value=1203.3 1740027951

For metrics ingested into cc-metric-store (via REST API or NATS), the cluster tag is additionally required:

flops_any,cluster=alex,hostname=e1208,type=core,type-id=23 value=1203.3 1740027951

Metric Scopes

We distinguish two primary scopes: Hardware Level and Node Level.

Hardware Level Metrics

These metrics track performance of specific sub-components within a node (e.g., a CPU core, GPU, or memory domain). The type-id tag identifies which component instance.

Schema:

<metric>,cluster=<c>,hostname=<h>,type=<component>,type-id=<index> value=<v> <time>

Example hardware types:

  • hwthread: Logical CPU threads. (IDs: 0..127 for Cluster1, 0..71 for Cluster2)
  • socket: Physical CPU sockets. (IDs: 0..1)
  • accelerator: GPUs or FPGA cards. (IDs: PCI Bus Address, e.g., 00000000:49:00.0)
  • memoryDomain: NUMA nodes. (IDs: 0..7)

Examples:

cpu_user,cluster=alex,hostname=a0603,type=hwthread,type-id=12 value=88.5 1725827464
core_power,cluster=fritz,hostname=f0201,type=socket,type-id=0 value=120.0 1725827464

Node Level Metrics

These metrics represent the aggregate state of the entire node. Set type=node; the type-id tag can be omitted or set to 0.

Schema:

<metric>,cluster=<c>,hostname=<h>,type=node value=<v> <time>

Example:

mem_used,cluster=alex,hostname=a0603,type=node value=64000.0 1725827464

Event Messages

Identification: event="<json>" field where the value is a JSON string.

The measurement name indicates the event class. The function tag specifies the purpose (similar to a REST endpoint path).

Event Classfunction values
jobstart_job, stop_job
slurmslurm-specific event types

Example:

job,hostname=mngmt02,type=node,type-id=0,function=stop_job event={"jobId": 69, "cluster": "ccfront", "stopTime": 1738842306, "jobState": "completed"} 1740027951

Control Messages

Identification: control="<string>" field where the value is the control request payload.

The measurement name is the control class. The method tag is either GET or PUT.

Control ClassDescription
raplCPU power capping (RAPL interface)
freqCPU frequency control
prefetcherHardware prefetcher control
topologyTopology configuration
configComponent configuration

Example:

rapl,hostname=e1208,type=socket,type-id=2,method=GET control=intel.pkg.energy_status 1740027951

To test metric ingestion with synthetic data, use the Metric Generator Script: Metric Generator Script