InfluxDB Line Protocol

Specification of the InfluxDB line-protocol flavor used for messaging between ClusterCockpit components, covering metrics, events, and control messages.

Overview

ClusterCockpit uses an InfluxData line-protocol flavor for transferring messages between its components. All messages share the same text-based format:

<measurement>,<tag_set> <field_set> <timestamp>

Where <tag_set> and <field_set> are comma-separated lists of key=value entries. The timestamp is Unix epoch time in seconds.

Backward Compatibility

Initially only metrics (number values) were sent. The specification was extended to support messages with different purposes (events, controls). This extension is backward-compatible — metric messages are unchanged.

Message Categories

Three message categories are distinguished by their field key:

Category	Field Key	Field Type	Purpose
Metric	`value=<number>`	float/integer	Performance metric time series
Event	`event="<json>"`	string (JSON)	Actionable job and cluster events
Control	`control="<string>"`	string	Component configuration requests

NATS Subject Hierarchy

ClusterCockpit uses NATS for messaging. The subject hierarchy lets components subscribe only to the message types they need:

<cluster name>. |
                --- metrics
                |
                --- events.[job, slurm]
                |
                --- control.[get, put]

Tag	Description	Values
`hostname`	Source node hostname	e.g., `node01`
`type`	Hardware scope	`node`, `socket`, `die`, `memoryDomain`, `llc`, `core`, `hwthread`, `accelerator`
`type-id`	Component index within the type	e.g., `0`, `1`, `2`

Metric Messages

Identification: value=<number> field where the value is a float or integer.

The measurement name is the metric name. While metric names can be chosen freely, the following core metrics should be present in any ClusterCockpit-compatible system:

Metric	Description	Unit
`flops_sp`	Single-precision floating point rate	Flops/s
`flops_dp`	Double-precision floating point rate	Flops/s
`flops_any`	Combined floating point rate	Flops/s
`cpu_load`	1-minute load average (`/proc/loadavg`)	—
`mem_used`	Memory used by applications (`/proc/meminfo`)	Bytes
`ipc`	Instructions per cycle	—
`mem_bw`	Main memory bandwidth (read + write)	MB/s
`cpu_power`	CPU package power consumption	W
`mem_power`	Memory subsystem power consumption	W
`clock`	CPU clock frequency	MHz

For the complete metric list see the job-data schema reference.

Example:

flops_any,hostname=e1208,type=core,type-id=23 value=1203.3 1740027951

For metrics ingested into cc-metric-store (via REST API or NATS), the cluster tag is additionally required:

flops_any,cluster=alex,hostname=e1208,type=core,type-id=23 value=1203.3 1740027951

Metric Scopes

We distinguish two primary scopes: Hardware Level and Node Level.

Hardware Level Metrics

These metrics track performance of specific sub-components within a node (e.g., a CPU core, GPU, or memory domain). The type-id tag identifies which component instance.

Schema:

<metric>,cluster=<c>,hostname=<h>,type=<component>,type-id=<index> value=<v> <time>

Example hardware types:

hwthread: Logical CPU threads. (IDs: 0..127 for Cluster1, 0..71 for Cluster2)
socket: Physical CPU sockets. (IDs: 0..1)
accelerator: GPUs or FPGA cards. (IDs: PCI Bus Address, e.g., 00000000:49:00.0)
memoryDomain: NUMA nodes. (IDs: 0..7)

Examples:

cpu_user,cluster=alex,hostname=a0603,type=hwthread,type-id=12 value=88.5 1725827464
core_power,cluster=fritz,hostname=f0201,type=socket,type-id=0 value=120.0 1725827464

Node Level Metrics

These metrics represent the aggregate state of the entire node. Set type=node; the type-id tag can be omitted or set to 0.

Schema:

<metric>,cluster=<c>,hostname=<h>,type=node value=<v> <time>

Example:

mem_used,cluster=alex,hostname=a0603,type=node value=64000.0 1725827464

Event Messages

Identification: event="<json>" field where the value is a JSON string.

The measurement name indicates the event class. The function tag specifies the purpose (similar to a REST endpoint path).

Event Class	`function` values
`job`	`start_job`, `stop_job`
`slurm`	slurm-specific event types

Example:

job,hostname=mngmt02,type=node,type-id=0,function=stop_job event={"jobId": 69, "cluster": "ccfront", "stopTime": 1738842306, "jobState": "completed"} 1740027951

Control Messages

Identification: control="<string>" field where the value is the control request payload.

The measurement name is the control class. The method tag is either GET or PUT.

Control Class	Description
`rapl`	CPU power capping (RAPL interface)
`freq`	CPU frequency control
`prefetcher`	Hardware prefetcher control
`topology`	Topology configuration
`config`	Component configuration

Example:

rapl,hostname=e1208,type=socket,type-id=2,method=GET control=intel.pkg.energy_status 1740027951

To test metric ingestion with synthetic data, use the Metric Generator Script: Metric Generator Script

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

InfluxDB Line Protocol

Tags:

Categories:

Overview

Backward Compatibility

Message Categories

NATS Subject Hierarchy

Tags

Mandatory Tags

Optional Tags

Metric Messages

Metric Scopes

Hardware Level Metrics

Node Level Metrics

Event Messages

Control Messages

Feedback

InfluxDB Line Protocol

Overview

Backward Compatibility

Message Categories

NATS Subject Hierarchy

Tags

Mandatory Tags

Optional Tags

Metric Messages

Metric Scopes

Hardware Level Metrics

Node Level Metrics

Event Messages

Control Messages

Related Tools

Feedback