This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

cc-node-controller

ClusterCockpit Node Controller References

Reference information regarding the ClusterCockpit component “cc-node-controller” (GitHub Repo).

Overview

cc-node-controller is a daemon that runs on each compute node of an HPC cluster. It subscribes to a NATS messaging subject and applies hardware-level control operations — such as setting CPU power limits via RAPL or adjusting GPU power limits via NVML — using the LIKWID sysfeatures library. It is the enforcement layer in the ClusterCockpit energy optimization stack: cc-energy-manager computes optimal power limits and sends them to cc-node-controller for application.

Architecture

flowchart LR
    cem[cc-energy-manager] -->|NATS SET| ccnc[cc-node-controller]
    rc[remoteclient] -->|NATS GET/SET| ccnc
    ccnc -->|LIKWID sysfeatures| hw[CPU / GPU Hardware\nRAP / NVML]
    ccnc -->|topology response| cem
    ccnc -->|controls list| rc

How It Works

On startup, cc-node-controller:

  1. Initializes the LIKWID sysfeatures subsystem (reads CPU topology from sysfs, loads hardware access libraries).
  2. Reads the configuration file and connects to the NATS server.
  3. Subscribes to the configured requestSubject.
  4. Enters a message loop, processing incoming control messages.

Each message is processed only if its hostname tag matches the node’s own short hostname — messages directed at other nodes are silently ignored, allowing all nodes in a cluster to share a single NATS subject.

Shutdown is triggered by SIGTERM or SIGINT.

NATS Message API

Messages use the ClusterCockpit line protocol format. The daemon handles three message types, identified by message name.

topology

Returns the hardware topology of the node.

Required tags:

TagValue
hostnameTarget node hostname
methodGET
typenode
type-id0

Response: A log message with level=INFO whose value is a JSON object:

{
  "hwthreads": [
    { "cpu_id": 0, "socket": 0, "die": 0, "core": 0, "numa_domain": 0, "smt_id": 0 },
    ...
  ],
  "cpu_info": {
    "num_hwthreads": 128,
    "smt_width": 2,
    "num_sockets": 2,
    "num_dies": 2,
    "num_cores": 64,
    "num_numa_domains": 8
  }
}

controls

Returns the list of hardware controls available on the node, as enumerated by LIKWID sysfeatures.

Required tags: same as topology (hostname, method=GET, type=node, type-id=0)

Response: A log message with level=INFO whose value is a JSON object:

{
  "controls": [
    {
      "category": "rapl",
      "name": "pkg_power_limit1",
      "device_type": "socket",
      "description": "RAPL package power limit 1",
      "methods": "ALL"
    },
    {
      "category": "cpu_freq",
      "name": "cur_cpu_freq",
      "device_type": "hwthread",
      "description": "Current CPU frequency",
      "methods": "GET"
    }
  ]
}

The full control name used in GET/PUT requests is <category>.<name> (e.g. rapl.pkg_power_limit1).

<control_name> (GET / PUT)

Reads or writes a specific hardware control value.

Required tags:

TagValue
hostnameTarget node hostname
methodGET or PUT
typeDevice type (see Device Types)
type-idDevice ID (integer as string, e.g. "0")

For PUT requests the message must also carry the control value field.

Response: A log message with tag level=INFO on success (value contains the result for GET) or level=ERROR on failure (value contains the error description).

Examples:

# GET: read RAPL package power limit on socket 0
rapl.pkg_power_limit1,hostname=node01,method=GET,type=socket,type-id=0 1234567890

# PUT: set RAPL package power limit on socket 0 to 150W
rapl.pkg_power_limit1,hostname=node01,method=PUT,type=socket,type-id=0 value="150" 1234567890

Device Types

TypeDescription
nodeWhole node / system level
hwthreadLogical CPU / hardware thread
corePhysical CPU core
socketCPU socket / package
dieCPU die
memoryDomainNUMA domain

The available device types for any given control are reported in the device_type field of the controls response.

Dependencies

  • LIKWID v5.5.0 or newer, compiled with BUILD_SYSFEATURES=true. The shared library liblikwid.so must be available at runtime (set LD_LIBRARY_PATH if installed to a non-standard path).
  • NATS server accessible from every compute node.
  • cc-energy-manager: Computes optimal power limits and sends SET commands to cc-node-controller. Also queries topology to map hardware threads to CPU sockets.
  • cc-metric-collector: Collects per-node hardware metrics (power consumption, instruction rate, etc.) forwarded to cc-energy-manager for optimization decisions.
  • LIKWID: Provides the sysfeatures abstraction layer for hardware control access.

1 - Commands

ClusterCockpit Node Controller Command Line References

Server Daemon

Build

make

Produces the cc-node-controller binary in the repository root. Debian and RPM packages can be built with make DEB and make RPM respectively.

Run

./cc-node-controller [options]

The daemon must run on the compute node it controls (it only processes messages matching its own hostname) and requires access to liblikwid.so.

Options

FlagDefaultDescription
-config <path>./config.jsonPath to the JSON configuration file
-loglevel <level>warnLog verbosity: debug, info, warn, error
-pretendfalseDry-run mode — process messages and log what would happen, but do not apply any hardware changes

Example

./cc-node-controller -config /etc/cc-node-controller/config.json -loglevel info

Signals

cc-node-controller handles the following UNIX signals for graceful shutdown:

  • SIGTERM — sent by systemd on systemctl stop
  • SIGINT — sent by Ctrl+C

Remote Client (remoteclient)

remoteclient is a command-line utility for interacting with a running cc-node-controller instance over NATS. It is useful for testing, diagnostics, and manual control operations.

Build

go build ./cmd/remoteclient/

Usage

./remoteclient -host <hostname> [options] <operation>

-host is always required.

Options

FlagDefaultDescription
-host <hostname>(required)Short hostname of the target node
-server <ip>127.0.0.1NATS server IP or hostname
-port <port>4222NATS server port
-request-subject <subject>cc-controlNATS subject used by the target node’s cc-node-controller
-debugfalseEnable debug output

Operations

Exactly one operation flag must be specified:

FlagDescription
-topologyPrint the hardware topology of the target node
-listList all controls available on the target node
-get <control>@<type>-<id>Read the current value of a control
-set <control>@<type>-<id>=<value>Write a new value to a control

The control address format is <category>.<name>@<device_type>-<device_id>, for example rapl.pkg_power_limit1@socket-0.

Examples

# List hardware topology
./remoteclient -host node01 -topology

# List available controls
./remoteclient -host node01 -list

# Read RAPL package power limit on socket 0
./remoteclient -host node01 -get rapl.pkg_power_limit1@socket-0

# Set RAPL package power limit on socket 0 to 150 W
./remoteclient -host node01 -set rapl.pkg_power_limit1@socket-0=150

# Connect to a remote NATS server
./remoteclient -host node01 -server nats.example.org -port 4222 -list

2 - Configuration

ClusterCockpit Node Controller Configuration Option References

Configuration is provided as a flat JSON file. The default path is ./config.json in the working directory; an alternative path can be specified with the -config flag.

Fields

FieldTypeRequiredDefaultDescription
serverstringyesIP address or hostname of the NATS server
portintegeryesPort of the NATS server
requestSubjectstringyesNATS subject to subscribe to for incoming control requests. All cc-node-controller instances in a cluster can share the same subject; each daemon only processes messages directed at its own hostname.
userstringnoUsername for NATS basic authentication
passwordstringnoPassword for NATS basic authentication
credsFilestringnoPath to a NATS credentials file (for NKey/JWT-based authentication)
nkeySeedFilestringnoPath to an NKey seed file
outstandingMessagesInQueueintegerno1000Size of the internal channel buffer for incoming NATS messages

Minimal Example

{
    "server": "127.0.0.1",
    "port": 4222,
    "requestSubject": "cc-control"
}

Full Example

{
    "server": "nats.example.org",
    "port": 4222,
    "requestSubject": "clustercockpit.control",
    "user": "ccnc-user",
    "password": "s3cr3t",
    "outstandingMessagesInQueue": 500
}

Authentication

cc-node-controller supports three NATS authentication methods. Only one should be configured at a time:

Basic authentication (user + password):

{
    "server": "nats.example.org",
    "port": 4222,
    "requestSubject": "cc-control",
    "user": "ccnc-user",
    "password": "s3cr3t"
}

Credentials file (NKey + JWT, generated by nsc):

{
    "server": "nats.example.org",
    "port": 4222,
    "requestSubject": "cc-control",
    "credsFile": "/etc/cc-node-controller/nats.creds"
}

NKey seed file:

{
    "server": "nats.example.org",
    "port": 4222,
    "requestSubject": "cc-control",
    "nkeySeedFile": "/etc/cc-node-controller/nkey.seed"
}

Request Subject and Multi-Cluster Setup

All cc-node-controller instances in a cluster can share the same requestSubject. Each daemon filters incoming messages by hostname and silently ignores messages intended for other nodes.

For multi-cluster deployments, use a different requestSubject per cluster, or use NATS subject namespacing (e.g. cc-control.fritz, cc-control.alex). The cc-energy-manager controller section uses a %c placeholder in its requestSubject to inject the cluster name automatically — the value here must match the expanded form for the respective cluster.