Configuration
Categories:
Configuration is provided as a JSON file. The default path is ./config.json in the working directory; an alternative path can be specified with the -config flag.
The configuration has four required top-level sections: receivers, sinks, controller, and clusters.
Receivers Section
A named map of receiver configurations. Each receiver defines a source from which cc-energy-manager ingests metric messages.
"receivers": {
"<name>": {
"type": "nats",
"address": "nats-server.example.org",
"port": "4222",
"subject": "metrics.subject"
}
}
Fields (NATS receiver):
type(string, required): Receiver type. Currently"nats".address(string, required): Hostname or IP of the NATS server.port(string, required): Port of the NATS server.subject(string, required): NATS subject to subscribe to.
Sinks Section
A named map of sink configurations. Each sink defines a destination for processed metric messages.
"sinks": {
"<name>": {
"type": "stdout",
"meta_as_tags": []
}
}
Fields (stdout sink):
type(string, required): Sink type. Examples:"stdout","influxasync".meta_as_tags(array of strings, optional): Metadata fields to promote to tags.
Controller Section
Configuration for the connection to cc-node-controller, which applies power limit changes on compute nodes.
"controller": {
"nats": {
"url": "nats://nats-server.example.org:4222",
"requestSubject": "cc-node-controller.%c.request"
},
"toposMaxAge": 86400
}
Fields:
nats(object, required): NATS connection settings forcc-node-controller.url(string, required): NATS server URL, e.g."nats://localhost:4222".requestSubject(string, required): NATS subject for sending control commands. Use%cas a placeholder for the cluster name — it is substituted at runtime for each cluster (e.g."cc-node-controller.%c.request"becomes"cc-node-controller.fritz.request"for clusterfritz).
toposMaxAge(integer, optional): How long to cache node hardware topology data in seconds. Default:86400(1 day).
Clusters Section
An array of cluster configurations. Each cluster defines which nodes to manage and how to optimize their power limits.
"clusters": [
{
"name": "fritz",
"powerBudgetTotal": 10000,
"partitionRegex": "^energy_efficient$",
"subclusters": [ ... ]
}
]
Cluster fields:
name(string, required): Cluster identifier. Must match the cluster name used in metric tags.powerBudgetTotal(number, required): Total power budget for this cluster in watts. Used for proportional budget distribution across device types.partitionRegex(string, required): Regular expression matched against the job’s partition/queue name. Only jobs on matching partitions are managed.subclusters(array, required): List of subcluster configurations (see below).
Subcluster Configuration
A subcluster groups nodes within a cluster that share the same hardware configuration.
{
"name": "main",
"hostRegex": "^f\\d\\d\\d\\d$",
"devicetypes": { ... }
}
Fields:
name(string, required): Subcluster identifier.hostRegex(string, required): Regular expression matched against node hostnames to assign nodes to this subcluster.devicetypes(object, required): Map of device type name to device type configuration (see below). Supported keys:"socket","nvidia_gpu","amd_gpu".
Device Type Configuration
Each entry in devicetypes configures how cc-energy-manager optimizes a specific hardware device type.
"socket": {
"scope": "node",
"aggregator": {
"type": "last",
"powerMetric": "cpu_energy",
"performanceMetric": "ips",
"deviceType": "socket"
},
"controlName": "rapl.pkg_power_limit1",
"controlDefaultValue": 300,
"intervalConverged": "10m",
"intervalSearch": "2m",
"powerBudgetWeight": 1,
"optimizer": {
"type": "gssng",
"tolerance": 5,
"borders": {
"lower": 123,
"upper": 800
}
}
}
Fields:
scope(string, required): Optimization granularity. One of:"job"— single optimizer shared across all devices of the job"node"— one optimizer per node, applied to all devices on that node"device"— independent optimizer per device on each node
aggregator(object, required): Metric aggregation configuration.type(string, required): Aggregation strategy:"last"(most recent value) or"median"(median over time window).powerMetric(string, required): Name of the power metric to track (e.g."cpu_energy","acc_power").performanceMetric(string, required): Name of the performance proxy metric (e.g."ips"for instructions per second,"kernels"for CUDA kernel count).deviceType(string, required): Device type from which to read metrics.
controlName(string, required): RAPL or NVML control name used to set the power limit. Examples:"rapl.pkg_power_limit1"— RAPL package power limit for CPU sockets"nvml.power_limit"— NVML power limit for NVIDIA GPUs
controlDefaultValue(number, required): Power limit in watts applied when no optimization is active (e.g. when a job ends).intervalConverged(string, required): How often to run optimization after convergence. Parsed as a Go duration string (e.g."10m","5m").intervalSearch(string, required): How often to run optimization during active search. Should be shorter thanintervalConverged(e.g."2m").powerBudgetWeight(number, required): Relative weight for budget allocation when multiple device types sharepowerBudgetTotal. A device type with weight 2 receives twice the budget fraction of a device type with weight 1.optimizer(object, required): Optimizer algorithm configuration.type(string, required): Algorithm type."gssng"(Golden Section Search with Narrowing/Broadening, recommended) or"gss"(plain Golden Section Search).tolerance(number, required): Convergence tolerance in watts. The optimizer considers itself converged when the search interval is smaller than this value.borders(object, required): Power limit bounds.lower(number, required): Minimum allowed power limit in watts.upper(number, required): Maximum allowed power limit in watts.
Complete Example
The following example configures two clusters: fritz (CPU-only nodes with socket-level optimization) and alex (GPU nodes with both GPU and CPU optimization).
{
"receivers": {
"testnats": {
"type": "nats",
"address": "nats-server.example.org",
"port": "4222",
"subject": "subject"
}
},
"sinks": {
"testoutput": {
"type": "stdout",
"meta_as_tags": []
}
},
"controller": {
"nats": {
"url": "nats://nats-server.example.org:4222",
"requestSubject": "cc-node-controller.%c.request"
},
"toposMaxAge": 86400
},
"clusters": [
{
"name": "fritz",
"powerBudgetTotal": 10000,
"partitionRegex": "^energy_efficient$",
"subclusters": [
{
"name": "main",
"hostRegex": "^f\\d\\d\\d\\d$",
"devicetypes": {
"socket": {
"scope": "node",
"aggregator": {
"type": "last",
"powerMetric": "cpu_energy",
"performanceMetric": "ips",
"deviceType": "socket"
},
"controlName": "rapl.pkg_power_limit1",
"controlDefaultValue": 300,
"intervalConverged": "10m",
"intervalSearch": "2m",
"powerBudgetWeight": 1,
"optimizer": {
"type": "gssng",
"tolerance": 5,
"borders": {
"lower": 123,
"upper": 800
}
}
}
}
}
]
},
{
"name": "alex",
"powerBudgetTotal": 20000,
"partitionRegex": "^only_this_partition_please$",
"subclusters": [
{
"name": "a100",
"hostRegex": "^a\\d\\d\\d\\d$",
"devicetypes": {
"nvidia_gpu": {
"scope": "device",
"aggregator": {
"type": "last",
"powerMetric": "acc_power",
"performanceMetric": "kernels",
"deviceType": "nvidia_gpu"
},
"controlName": "nvml.power_limit",
"controlDefaultValue": 250,
"intervalConverged": "5m",
"intervalSearch": "2m",
"powerBudgetWeight": 2,
"optimizer": {
"type": "gssng",
"tolerance": 5,
"borders": {
"lower": 123,
"upper": 800
}
}
},
"socket": {
"scope": "job",
"aggregator": {
"type": "median",
"powerMetric": "cpu_energy",
"performanceMetric": "ips",
"deviceType": "nvidia_gpu"
},
"controlName": "rapl.pkg_power_limit1",
"controlDefaultValue": 100,
"intervalConverged": "5m",
"intervalSearch": "2m",
"powerBudgetWeight": 1,
"optimizer": {
"type": "gssng",
"tolerance": 5,
"borders": {
"lower": 123,
"upper": 800
}
}
}
}
}
]
}
]
}
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.