cc-metric-collector
ClusterCockpit Metric Collector References
Reference information regarding the ClusterCockpit component “cc-metric-collector” (GitHub Repo).
Overview
cc-metric-collector is a node agent for measuring, processing and forwarding node level metrics. It is part of the ClusterCockpit ecosystem.
The metric collector sends (and receives) metrics in the InfluxDB line protocol as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).
Key Features
- Modular Architecture: Flexible plugin-based system with collectors, sinks, receivers, and router
- Multiple Data Sources: Collect metrics from various sources (procfs, sysfs, hardware libraries, custom commands)
- Flexible Output: Send metrics to multiple sinks simultaneously (InfluxDB, Prometheus, NATS, etc.)
- On-the-fly Processing: Router can tag, filter, aggregate, and transform metrics before forwarding
- Network Receiver: Accept metrics from other collectors to create hierarchical setups
- Low Overhead: Efficient serial collection with single timestamp per interval
Architecture
There is a single timer loop that triggers all collectors serially, collects the data and sends the metrics to the configured sinks. This ensures all data is submitted with a single timestamp. The sinks currently use mostly blocking APIs.
The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.
flowchart LR
subgraph col ["Collectors"]
direction TB
cpustat["cpustat"]
memstat["memstat"]
tempstat["tempstat"]
misc["..."]
end
subgraph Receivers ["Receivers"]
direction TB
nats["NATS"]
httprecv["HTTP"]
miscrecv[...]
end
subgraph calc["Aggregator"]
direction LR
cache["Cache"]
agg["Calculator"]
end
subgraph sinks ["Sinks"]
direction RL
influx["InfluxDB"]
ganglia["Ganglia"]
logger["Logfile"]
miscsink["..."]
end
cpustat --> CollectorManager["CollectorManager"]
memstat --> CollectorManager
tempstat --> CollectorManager
misc --> CollectorManager
nats --> ReceiverManager["ReceiverManager"]
httprecv --> ReceiverManager
miscrecv --> ReceiverManager
CollectorManager --> newrouter["Router"]
ReceiverManager -.-> newrouter
calc -.-> newrouter
newrouter --> SinkManager["SinkManager"]
newrouter -.-> calc
SinkManager --> influx
SinkManager --> ganglia
SinkManager --> logger
SinkManager --> miscsink
Components
- Collectors: Read data from local system sources (files, commands, libraries) and send to router
- Router: Process metrics by caching, filtering, tagging, renaming, and aggregating
- Sinks: Send metrics to storage backends (InfluxDB, Prometheus, NATS, etc.)
- Receivers: Accept metrics from other collectors via network (HTTP, NATS) and forward to router
The key difference between collectors and receivers is that collectors are called periodically while receivers run continuously and submit metrics at any time.
Supported Metrics
Supported metrics are documented in the cc-specifications.
Deployment Scenarios
The metric collector was designed with flexibility in mind, so it can be used in many scenarios:
Direct to Database
flowchart TD
subgraph a ["Cluster A"]
nodeA[NodeA with CC collector]
nodeB[NodeB with CC collector]
nodeC[NodeC with CC collector]
end
a --> db[(Database)]
db <--> ccweb("Webfrontend")Hierarchical Collection
flowchart TD
subgraph a [ClusterA]
direction LR
nodeA[NodeA with CC collector]
nodeB[NodeB with CC collector]
nodeC[NodeC with CC collector]
end
subgraph b [ClusterB]
direction LR
nodeD[NodeD with CC collector]
nodeE[NodeE with CC collector]
nodeF[NodeF with CC collector]
end
a --> ccrecv{"CC collector as receiver"}
b --> ccrecv
ccrecv --> db[("Database1")]
ccrecv -.-> db2[("Database2")]
db <-.-> ccweb("Webfrontend")Links
1 - Configuration
cc-metric-collector Configuration Reference
Configuration Overview
The configuration of cc-metric-collector consists of five configuration files: one global file and four component-related files.
Configuration is implemented using a single JSON document that can be distributed over the network and persisted as a file.
Global Configuration File
The global file contains paths to the other four component files and some global options.
Default location: /etc/cc-metric-collector/config.json (can be overridden with -config flag)
Example
{
"sinks-file": "/etc/cc-metric-collector/sinks.json",
"collectors-file": "/etc/cc-metric-collector/collectors.json",
"receivers-file": "/etc/cc-metric-collector/receivers.json",
"router-file": "/etc/cc-metric-collector/router.json",
"main": {
"interval": "10s",
"duration": "1s"
}
}
Note: Paths are relative to the execution folder of the cc-metric-collector binary, so it is recommended to use absolute paths.
Configuration Reference
| Config Key | Type | Default | Description |
|---|
sinks-file | string | - | Path to sinks configuration file (relative or absolute) |
collectors-file | string | - | Path to collectors configuration file (relative or absolute) |
receivers-file | string | - | Path to receivers configuration file (relative or absolute) |
router-file | string | - | Path to router configuration file (relative or absolute) |
main.interval | string | 10s | How often metrics should be read and sent to sinks. Parsed using time.ParseDuration() |
main.duration | string | 1s | How long one measurement should take. Important for collectors like likwid that measure over time. |
Instead of separate files, you can embed component configurations directly:
{
"sinks": {
"mysink": {
"type": "influxasync",
"host": "localhost",
"port": "8086"
}
},
"collectors": {
"cpustat": {}
},
"receivers": {},
"router": {
"interval_timestamp": false
},
"main": {
"interval": "10s",
"duration": "1s"
}
}
Component Configuration Files
Collectors Configuration
The collectors configuration file specifies which metrics should be queried from the system. See Collectors for available collectors and their configuration options.
Format: Unlike sinks and receivers, the collectors configuration is a set of objects (not a list).
File: collectors.json
Example:
{
"cpustat": {},
"memstat": {},
"diskstat": {
"exclude_metrics": [
"disk_total"
]
},
"likwid": {
"access_mode": "direct",
"liblikwid_path": "/usr/local/lib/liblikwid.so",
"eventsets": [
{
"events": {
"cpu": ["FLOPS_DP"]
}
}
]
}
}
Common Options (available for most collectors):
| Option | Type | Description |
|---|
exclude_metrics | []string | List of metric names to exclude from forwarding to sinks |
send_meta | bool | Send metadata information along with metrics (default varies) |
See: Collectors Documentation for collector-specific configuration options.
Note: Some collectors dynamically load shared libraries. Ensure the library path is part of the LD_LIBRARY_PATH environment variable.
Sinks Configuration
The sinks configuration file defines where metrics should be sent. Multiple sinks of the same or different types can be configured.
Format: Object with named sink configurations
File: sinks.json
Example:
{
"local_influx": {
"type": "influxasync",
"host": "localhost",
"port": "8086",
"organization": "myorg",
"database": "metrics",
"password": "mytoken"
},
"central_prometheus": {
"type": "prometheus",
"host": "0.0.0.0",
"port": "9091"
},
"debug_log": {
"type": "stdout"
}
}
Common Sink Types:
| Type | Description |
|---|
influxasync | InfluxDB v2 asynchronous writer |
influxdb | InfluxDB v2 synchronous writer |
prometheus | Prometheus Pushgateway |
nats | NATS messaging system |
stdout | Standard output (for debugging) |
libganglia | Ganglia monitoring system |
http | Generic HTTP endpoint |
See: cc-lib Sinks Documentation for sink-specific configuration options.
Note: Some sinks dynamically load shared libraries. Ensure the library path is part of the LD_LIBRARY_PATH environment variable.
Router Configuration
The router sits between collectors/receivers and sinks, enabling metric processing such as tagging, filtering, renaming, and aggregation.
File: router.json
Simple Example:
{
"add_tags": [
{
"key": "cluster",
"value": "mycluster",
"if": "*"
}
],
"interval_timestamp": false,
"num_cache_intervals": 0
}
Advanced Example:
{
"num_cache_intervals": 1,
"interval_timestamp": true,
"hostname_tag": "hostname",
"max_forward": 50,
"process_messages": {
"manipulate_messages": [
{
"add_base_tags": {
"cluster": "mycluster"
}
}
]
}
}
Configuration Reference:
| Option | Type | Default | Description |
|---|
interval_timestamp | bool | false | Use common timestamp (interval start) for all metrics in an interval |
num_cache_intervals | int | 0 | Number of past intervals to cache (0 disables cache, required for interval aggregates) |
hostname_tag | string | "hostname" | Tag name for hostname (added to locally created metrics) |
max_forward | int | 50 | Max metrics to read from a channel at once (must be > 1) |
process_messages | object | - | Message processor configuration (see below) |
See: Router Documentation for detailed configuration options and Message Processor for advanced processing.
Receivers Configuration
Receivers enable cc-metric-collector to accept metrics from other collectors via network protocols. For most standalone setups, this file can contain only an empty JSON map ({}).
File: receivers.json
Example:
{
"nats_rack0": {
"type": "nats",
"address": "nats-server.example.org",
"port": "4222",
"subject": "rack0"
},
"http_receiver": {
"type": "http",
"address": "0.0.0.0",
"port": "8080",
"path": "/api/write"
}
}
Common Receiver Types:
| Type | Description |
|---|
nats | NATS subscriber |
http | HTTP server endpoint for metric ingestion |
See: cc-lib Receivers Documentation for receiver-specific configuration options.
Configuration Examples
Complete example configurations can be found in the example-configs directory of the repository.
Configuration Validation
To validate your configuration before running the collector:
# Test configuration loading
cc-metric-collector -config /path/to/config.json -once
The -once flag runs all collectors only once and exits, useful for testing.
2 - Installation
Building and installing cc-metric-collector
Building from Source
Prerequisites
- Go 1.16 or higher
- Git
- Make
- Standard build tools (gcc, etc.)
Basic Build
In most cases, a simple make in the main folder is enough to get a cc-metric-collector binary:
git clone https://github.com/ClusterCockpit/cc-metric-collector.git
cd cc-metric-collector
make
The build process automatically:
- Downloads dependencies via
go get - Checks for LIKWID library (for LIKWID collector)
- Downloads and builds LIKWID as a static library if not found
- Copies required header files for cgo bindings
Build Output
After successful build, you’ll have:
cc-metric-collector binary in the project root- LIKWID library and headers (if LIKWID collector was built)
System Integration
Configuration Files
Create a directory for configuration files:
sudo mkdir -p /etc/cc-metric-collector
sudo cp example-configs/*.json /etc/cc-metric-collector/
Edit the configuration files according to your needs. See Configuration for details.
User and Group Setup
It’s recommended to run cc-metric-collector as a dedicated user:
sudo useradd -r -s /bin/false cc-metric-collector
sudo mkdir -p /var/log/cc-metric-collector
sudo chown cc-metric-collector:cc-metric-collector /var/log/cc-metric-collector
Pre-configuration
The main configuration settings for system integration are pre-defined in scripts/cc-metric-collector.config. This file contains:
- UNIX user and group for execution
- PID file location
- Other system settings
Adjust and install it:
# Edit the configuration
editor scripts/cc-metric-collector.config
# Install to system location
sudo install --mode 644 \
--owner root \
--group root \
scripts/cc-metric-collector.config /etc/default/cc-metric-collector
Systemd Integration
If you are using systemd as your init system:
# Install the systemd service file
sudo install --mode 644 \
--owner root \
--group root \
scripts/cc-metric-collector.service /etc/systemd/system/cc-metric-collector.service
# Reload systemd daemon
sudo systemctl daemon-reload
# Enable the service to start on boot
sudo systemctl enable cc-metric-collector
# Start the service
sudo systemctl start cc-metric-collector
# Check status
sudo systemctl status cc-metric-collector
SysVinit Integration
If you are using an init system based on /etc/init.d daemons:
# Install the init script
sudo install --mode 755 \
--owner root \
--group root \
scripts/cc-metric-collector.init /etc/init.d/cc-metric-collector
# Enable the service
sudo update-rc.d cc-metric-collector defaults
# Start the service
sudo /etc/init.d/cc-metric-collector start
The init script reads basic configuration from /etc/default/cc-metric-collector.
Package Installation
RPM Packages
To build RPM packages:
Requirements:
- RPM tools (
rpm and rpmspec) - Git
The command uses the RPM SPEC file scripts/cc-metric-collector.spec and creates packages in the project directory.
Install the generated RPM:
sudo rpm -ivh cc-metric-collector-*.rpm
DEB Packages
To build Debian packages:
Requirements:
The command uses the DEB control file scripts/cc-metric-collector.control and creates a binary deb package.
Install the generated DEB:
sudo dpkg -i cc-metric-collector_*.deb
Note: DEB package creation is experimental and not as well tested as RPM packages.
Customizing Packages
To customize RPM or DEB packages for your local system:
- Fork the cc-metric-collector repository
- Enable GitHub Actions in your fork
- Make changes to scripts, code, etc.
- Commit and push your changes
- Tag the commit:
git tag v0.x.y-myversion - Push tags:
git push --tags - Wait for the Release action to complete
- Download RPMs/DEBs from the Releases page of your fork
Library Dependencies
LIKWID Collector
The LIKWID collector requires the LIKWID library. There is currently no Golang interface to LIKWID, so cgo is used to create bindings.
The build process handles LIKWID automatically:
- Checks if LIKWID is installed system-wide
- If not found, downloads and builds LIKWID with
direct access mode - Copies necessary header files
To use a pre-installed LIKWID:
export LD_LIBRARY_PATH=/path/to/likwid/lib:$LD_LIBRARY_PATH
Other Dynamic Libraries
Some collectors and sinks dynamically load shared libraries:
| Component | Library | Purpose |
|---|
| LIKWID collector | liblikwid.so | Hardware performance data |
| NVIDIA collector | libnvidia-ml.so | NVIDIA GPU metrics |
| ROCm collector | librocm_smi64.so | AMD GPU metrics |
| Ganglia sink | libganglia.so | Ganglia metric submission |
Ensure required libraries are in your LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
Permissions
Hardware Access
Some collectors require special permissions:
| Collector | Requirement | Solution |
|---|
| LIKWID (direct) | Direct hardware access | Run as root or use capabilities |
| IPMI | Access to IPMI devices | User must be in ipmi group |
| Temperature | Access to /sys/class/hwmon | Usually readable by all users |
| GPU collectors | Access to GPU management libraries | User must have GPU access rights |
Setting Capabilities (Alternative to Root)
For LIKWID direct access without running as root:
sudo setcap cap_sys_rawio=ep /path/to/cc-metric-collector
Warning: Direct hardware access can be dangerous if misconfigured. Use with caution.
Verification
After installation, verify the collector is working:
# Test configuration
cc-metric-collector -config /etc/cc-metric-collector/config.json -once
# Check logs
journalctl -u cc-metric-collector -f
# Or for SysV
tail -f /var/log/cc-metric-collector/collector.log
Troubleshooting
Common Issues
Issue: cannot find liblikwid.so
- Solution: Set
LD_LIBRARY_PATH or configure in systemd service file
Issue: permission denied accessing hardware
- Solution: Run as root, use capabilities, or adjust file permissions
Issue: Configuration file not found
- Solution: Use
-config flag or place config.json in execution directory
Issue: Metrics not appearing in sink
- Solution: Check sink configuration, network connectivity, and router settings
Debug Mode
Run in foreground with debug output:
cc-metric-collector -config /path/to/config.json -log stderr
Run collectors only once for testing:
cc-metric-collector -config /path/to/config.json -once
3 - Usage
Running and using cc-metric-collector
Command Line Interface
Basic Usage
cc-metric-collector [options]
Command Line Options
| Flag | Type | Default | Description |
|---|
-config | string | ./config.json | Path to configuration file |
-log | string | stderr | Path for logfile (use stderr for console) |
-once | bool | false | Run all collectors only once then exit |
Examples
Run with default configuration:
Run with custom configuration:
cc-metric-collector -config /etc/cc-metric-collector/config.json
Log to file:
cc-metric-collector -config /etc/cc-metric-collector/config.json \
-log /var/log/cc-metric-collector/collector.log
Test configuration (run once):
cc-metric-collector -config /etc/cc-metric-collector/config.json -once
This runs all collectors exactly once and exits. Useful for:
- Testing configuration
- Debugging collector issues
- Validating metric output
- One-time metric collection
Running as a Service
Systemd
Start service:
sudo systemctl start cc-metric-collector
Stop service:
sudo systemctl stop cc-metric-collector
Restart service:
sudo systemctl restart cc-metric-collector
Check status:
sudo systemctl status cc-metric-collector
View logs:
journalctl -u cc-metric-collector -f
Enable on boot:
sudo systemctl enable cc-metric-collector
SysVinit
Start service:
sudo /etc/init.d/cc-metric-collector start
Stop service:
sudo /etc/init.d/cc-metric-collector stop
Restart service:
sudo /etc/init.d/cc-metric-collector restart
Check status:
sudo /etc/init.d/cc-metric-collector status
Operation Modes
Daemon Mode (Default)
In daemon mode, cc-metric-collector runs continuously with a timer loop that:
- Triggers all enabled collectors serially
- Collects metrics with a single timestamp per interval
- Forwards metrics through the router
- Sends processed metrics to all configured sinks
- Sleeps until the next interval
Interval timing is controlled by the main.interval configuration parameter.
One-Shot Mode
Activated with the -once flag, this mode:
- Initializes all collectors
- Runs each collector exactly once
- Processes and forwards metrics
- Exits
Useful for:
- Configuration testing
- Debugging
- Cron-based metric collection
- Integration with other monitoring tools
Metric Collection Flow
sequenceDiagram
participant Timer
participant Collectors
participant Router
participant Sinks
Timer->>Collectors: Trigger (every interval)
Collectors->>Collectors: Read metrics from system
Collectors->>Router: Forward metrics
Router->>Router: Process (tag, filter, aggregate)
Router->>Sinks: Send processed metrics
Sinks->>Sinks: Write to backends
Timer->>Timer: Sleep until next intervalCommon Usage Patterns
Basic Monitoring Setup
Collect basic system metrics and send to InfluxDB:
config.json:
{
"collectors-file": "./collectors.json",
"sinks-file": "./sinks.json",
"receivers-file": "./receivers.json",
"router-file": "./router.json",
"main": {
"interval": "10s",
"duration": "1s"
}
}
collectors.json:
{
"cpustat": {},
"memstat": {},
"diskstat": {},
"netstat": {},
"loadavg": {}
}
sinks.json:
{
"influx": {
"type": "influxasync",
"host": "influx.example.org",
"port": "8086",
"organization": "myorg",
"database": "metrics",
"password": "mytoken"
}
}
router.json:
{
"add_tags": [
{
"key": "cluster",
"value": "production",
"if": "*"
}
],
"interval_timestamp": true
}
receivers.json:
HPC Node Monitoring
Extended monitoring for HPC compute nodes:
collectors.json:
{
"cpustat": {},
"memstat": {},
"diskstat": {},
"netstat": {},
"loadavg": {},
"tempstat": {},
"likwid": {
"access_mode": "direct",
"liblikwid_path": "/usr/local/lib/liblikwid.so",
"eventsets": [
{
"events": {
"cpu": ["FLOPS_DP", "CLOCK"]
}
}
]
},
"nvidia": {},
"ibstat": {}
}
Hierarchical Collection
Compute nodes send to aggregation node:
Node config - sinks.json:
{
"nats_aggregator": {
"type": "nats",
"host": "aggregator.example.org",
"port": "4222",
"subject": "cluster.rack1"
}
}
Aggregation node config - receivers.json:
{
"nats_rack1": {
"type": "nats",
"address": "localhost",
"port": "4222",
"subject": "cluster.rack1"
},
"nats_rack2": {
"type": "nats",
"address": "localhost",
"port": "4222",
"subject": "cluster.rack2"
}
}
Aggregation node config - sinks.json:
{
"influx": {
"type": "influxasync",
"host": "influx.example.org",
"port": "8086",
"organization": "myorg",
"database": "metrics",
"password": "mytoken"
}
}
Multi-Sink Configuration
Send metrics to multiple destinations:
sinks.json:
{
"primary_influx": {
"type": "influxasync",
"host": "influx1.example.org",
"port": "8086",
"organization": "myorg",
"database": "metrics",
"password": "token1"
},
"backup_influx": {
"type": "influxasync",
"host": "influx2.example.org",
"port": "8086",
"organization": "myorg",
"database": "metrics",
"password": "token2"
},
"prometheus": {
"type": "prometheus",
"host": "0.0.0.0",
"port": "9091"
}
}
Monitoring and Debugging
Check Collector Status
Use -once mode to test without running continuously:
cc-metric-collector -config /etc/cc-metric-collector/config.json -once
Debug Output
Log to stderr for immediate feedback:
cc-metric-collector -config /etc/cc-metric-collector/config.json -log stderr
Verify Metrics
Check what metrics are being collected:
- Configure stdout sink temporarily
- Run in
-once mode - Observe metric output
Temporary debug sink:
{
"debug": {
"type": "stdout"
}
}
Common Issues
No metrics appearing:
- Check collector configuration
- Verify collectors have required permissions
- Ensure sinks are reachable
- Check router isn’t filtering metrics
High CPU usage:
- Increase
main.interval value - Disable expensive collectors
- Check for router performance issues
Memory growth:
- Reduce
num_cache_intervals in router - Check for sink write failures
- Verify metric cardinality isn’t excessive
Interval Adjustment
Faster updates (more overhead):
{
"main": {
"interval": "5s",
"duration": "1s"
}
}
Slower updates (less overhead):
{
"main": {
"interval": "60s",
"duration": "1s"
}
}
Collector Selection
Only enable collectors you need:
{
"cpustat": {},
"memstat": {}
}
Metric Filtering
Use router to exclude unwanted metrics:
{
"process_messages": {
"manipulate_messages": [
{
"drop_by_name": ["cpu_idle", "cpu_iowait"]
}
]
}
}
Security Considerations
Running as Non-Root
Most collectors work without root privileges, except:
- LIKWID (direct mode)
- IPMI collector
- Some hardware-specific collectors
Use capabilities instead of root when possible.
Network Security
When using receivers:
- Use authentication (NATS credentials, HTTP tokens)
- Restrict listening addresses
- Use TLS for encrypted transport
- Firewall receiver ports appropriately
File Permissions
Protect configuration files containing credentials:
sudo chmod 600 /etc/cc-metric-collector/config.json
sudo chown cc-metric-collector:cc-metric-collector /etc/cc-metric-collector/config.json
4 - Metric Router
Routing and processing metrics in cc-metric-collector
Overview
The metric router sits between collectors/receivers and sinks, enabling metric processing such as:
- Adding and removing tags
- Filtering and dropping metrics
- Renaming metrics
- Aggregating metrics across an interval
- Normalizing units
- Setting common timestamps
Basic Configuration
File: router.json
Minimal configuration:
{
"interval_timestamp": false,
"num_cache_intervals": 0
}
Typical configuration:
{
"add_tags": [
{
"key": "cluster",
"value": "mycluster",
"if": "*"
}
],
"interval_timestamp": true,
"num_cache_intervals": 0
}
Configuration Options
Core Settings
| Option | Type | Default | Description |
|---|
interval_timestamp | bool | false | Use common timestamp (interval start) for all metrics in an interval |
num_cache_intervals | int | 0 | Number of past intervals to cache (0 disables cache, required for interval aggregates) |
hostname_tag | string | "hostname" | Tag name for hostname (added to locally created metrics) |
max_forward | int | 50 | Max metrics to read from a channel at once (must be > 1) |
The interval_timestamp Option
Collectors’ Read() functions are not called simultaneously, so metrics within an interval can have different timestamps.
When true: All metrics in an interval get a common timestamp (the interval start time)
When false: Each metric keeps its original collection timestamp
Use case: Enable this to simplify time-series alignment in your database.
The num_cache_intervals Option
Controls metric caching for interval aggregations.
| Value | Behavior |
|---|
0 | Cache disabled (no aggregations possible) |
1 | Cache last interval only (minimal memory, basic aggregations) |
2+ | Cache multiple intervals (for complex time-based aggregations) |
Note: Required to be > 0 for interval_aggregates to work.
The hostname_tag Option
By default, the router tags locally created metrics with the hostname.
Default tag name: hostname
Custom tag name:
{
"hostname_tag": "node"
}
The max_forward Option
Performance tuning for metric processing.
How it works: When the router receives a metric, it tries to read up to max_forward additional metrics from the same channel before processing.
Default: 50
Must be: Greater than 1
Metric Processing
Modern Configuration (Recommended)
Use the process_messages section with the message processor:
{
"process_messages": {
"manipulate_messages": [
{
"add_base_tags": {
"cluster": "mycluster",
"partition": "compute"
}
},
{
"drop_by_name": ["cpu_idle", "mem_cached"]
},
{
"rename_by": {
"clock_mhz": "clock"
}
}
]
}
}
Legacy Configuration (Deprecated)
The following options are deprecated but still supported for backward compatibility. They are automatically converted to process_messages format.
Deprecated syntax:
{
"add_tags": [
{
"key": "cluster",
"value": "mycluster",
"if": "*"
},
{
"key": "type",
"value": "socket",
"if": "name == 'temp_package_id_0'"
}
]
}
Modern equivalent:
{
"process_messages": {
"manipulate_messages": [
{
"add_base_tags": {
"cluster": "mycluster"
}
},
{
"add_tags_by": {
"type": "socket"
},
"if": "name == 'temp_package_id_0'"
}
]
}
}
Deprecated syntax:
{
"delete_tags": [
{
"key": "unit",
"if": "*"
}
]
}
Never delete these tags: hostname, type, type-id
Dropping Metrics
By name (deprecated):
{
"drop_metrics": [
"not_interesting_metric",
"debug_metric"
]
}
By condition (deprecated):
{
"drop_metrics_if": [
"match('temp_core_%d+', name)",
"match('cpu', type) && type-id == 0"
]
}
Modern equivalent:
{
"process_messages": {
"manipulate_messages": [
{
"drop_by_name": ["not_interesting_metric", "debug_metric"]
},
{
"drop_by": "match('temp_core_%d+', name)"
}
]
}
}
Renaming Metrics
Deprecated syntax:
{
"rename_metrics": {
"old_name": "new_name",
"clock_mhz": "clock"
}
}
Modern equivalent:
{
"process_messages": {
"manipulate_messages": [
{
"rename_by": {
"old_name": "new_name",
"clock_mhz": "clock"
}
}
]
}
}
Use case: Standardize metric names across different systems or collectors.
Normalizing Units
Deprecated syntax:
{
"normalize_units": true
}
Effect: Normalizes unit names (e.g., byte, Byte, B, bytes → consistent format)
Changing Unit Prefixes
Deprecated syntax:
{
"change_unit_prefix": {
"mem_used": "G",
"mem_total": "G"
}
}
Use case: Convert memory metrics from kB (as reported by /proc/meminfo) to GB for better readability.
Interval Aggregates (Experimental)
Requires: num_cache_intervals > 0
Derive new metrics by aggregating metrics from the current interval.
Configuration
{
"num_cache_intervals": 1,
"interval_aggregates": [
{
"name": "temp_cores_avg",
"if": "match('temp_core_%d+', metric.Name())",
"function": "avg(values)",
"tags": {
"type": "node"
},
"meta": {
"group": "IPMI",
"unit": "degC",
"source": "TempCollector"
}
}
]
}
Parameters
| Field | Type | Description |
|---|
name | string | Name of the new derived metric |
if | string | Condition to select which metrics to aggregate |
function | string | Aggregation function (e.g., avg(values), sum(values), max(values)) |
tags | object | Tags to add to the derived metric |
meta | object | Metadata for the derived metric (use "<copy>" to copy from source metrics) |
Available Functions
| Function | Description |
|---|
avg(values) | Average of all matching metrics |
sum(values) | Sum of all matching metrics |
min(values) | Minimum value |
max(values) | Maximum value |
count(values) | Number of matching metrics |
Complex Example
Calculate mem_used from multiple memory metrics:
{
"interval_aggregates": [
{
"name": "mem_used",
"if": "source == 'MemstatCollector'",
"function": "sum(mem_total) - (sum(mem_free) + sum(mem_buffers) + sum(mem_cached))",
"tags": {
"type": "node"
},
"meta": {
"group": "<copy>",
"unit": "<copy>",
"source": "<copy>"
}
}
]
}
Dropping Source Metrics
If you only want the aggregated metric, drop the source metrics:
{
"drop_metrics_if": [
"match('temp_core_%d+', metric.Name())"
],
"interval_aggregates": [
{
"name": "temp_cores_avg",
"if": "match('temp_core_%d+', metric.Name())",
"function": "avg(values)",
"tags": {
"type": "node"
},
"meta": {
"group": "IPMI",
"unit": "degC"
}
}
]
}
Processing Order
The router processes metrics in a specific order:
- Add
hostname_tag (if sent by collectors or cache) - Change timestamp to interval timestamp (if
interval_timestamp == true) - Check if metric should be dropped (
drop_metrics, drop_metrics_if) - Add tags (
add_tags) - Delete tags (
del_tags) - Rename metric (
rename_metrics) and store old name in meta as oldname - Add tags again (to support conditions using new name)
- Delete tags again (to support conditions using new name)
- Normalize units (if
normalize_units == true) - Convert unit prefix (
change_unit_prefix) - Send to sinks
- Move to cache (if
num_cache_intervals > 0)
Legend:
- Operations apply to metrics from collectors (c)
- Operations apply to metrics from receivers (r)
- Operations apply to both (c,r)
Complete Example
{
"interval_timestamp": true,
"num_cache_intervals": 1,
"hostname_tag": "hostname",
"max_forward": 50,
"process_messages": {
"manipulate_messages": [
{
"add_base_tags": {
"cluster": "production",
"datacenter": "dc1"
}
},
{
"drop_by_name": ["cpu_idle", "cpu_guest", "cpu_guest_nice"]
},
{
"rename_by": {
"clock_mhz": "clock"
}
},
{
"add_tags_by": {
"high_temp": "true"
},
"if": "name == 'temp_package_id_0' && value > 70"
}
]
},
"interval_aggregates": [
{
"name": "temp_avg",
"if": "match('temp_core_%d+', name)",
"function": "avg(values)",
"tags": {
"type": "node"
},
"meta": {
"group": "Temperature",
"unit": "degC",
"source": "TempCollector"
}
}
]
}
- Caching: Only enable if you need interval aggregates (memory overhead)
- Complex conditions: Evaluated for every metric (CPU overhead)
- Aggregations: Evaluated at the start of each interval (CPU overhead)
- max_forward: Higher values can improve throughput but increase latency
See Also
5 - Collectors
Available metric collectors for cc-metric-collector
Overview
Collectors read data from various sources on the local system, parse it into metrics, and submit these metrics to the router. Each collector is a modular plugin that can be enabled or disabled independently.
File: collectors.json
The collectors configuration is a set of objects (not a list), where each key is the collector type:
{
"collector_type": {
"collector_specific_option": "value"
}
}
Common Configuration Options
Most collectors support these common options:
| Option | Type | Default | Description |
|---|
exclude_metrics | []string | [] | List of metric names to exclude from forwarding to sinks |
send_meta | bool | varies | Send metadata information along with metrics |
Example:
{
"cpustat": {
"exclude_metrics": ["cpu_idle", "cpu_guest"]
},
"memstat": {}
}
Available Collectors
System Metrics
| Collector | Description | Source |
|---|
cpustat | CPU usage statistics | /proc/stat |
memstat | Memory usage statistics | /proc/meminfo |
loadavg | System load average | /proc/loadavg |
netstat | Network interface statistics | /proc/net/dev |
diskstat | Disk I/O statistics | /sys/block/*/stat |
iostat | Block device I/O statistics | /proc/diskstats |
Hardware Monitoring
| Collector | Description | Requirements |
|---|
tempstat | Temperature sensors | /sys/class/hwmon |
cpufreq | CPU frequency | /sys/devices/system |
cpufreq_cpuinfo | CPU frequency from cpuinfo | /proc/cpuinfo |
ipmistat | IPMI sensor data | ipmitool command |
| Collector | Description | Requirements |
|---|
likwid | Hardware performance counters via LIKWID | liblikwid.so |
rapl | CPU energy consumption (RAPL) | /sys/class/powercap |
schedstat | CPU scheduler statistics | /proc/schedstat |
numastats | NUMA node statistics | /sys/devices/system/node |
GPU Monitoring
| Collector | Description | Requirements |
|---|
nvidia | NVIDIA GPU metrics | libnvidia-ml.so (NVML) |
rocm_smi | AMD ROCm GPU metrics | librocm_smi64.so |
Network & Storage
| Collector | Description | Requirements |
|---|
ibstat | InfiniBand statistics | /sys/class/infiniband |
lustrestat | Lustre filesystem statistics | Lustre client |
gpfs | GPFS filesystem statistics | GPFS utilities |
beegfs_meta | BeeGFS metadata statistics | BeeGFS metadata client |
beegfs_storage | BeeGFS storage statistics | BeeGFS storage client |
nfs3stat | NFS v3 statistics | /proc/net/rpc/nfs |
nfs4stat | NFS v4 statistics | /proc/net/rpc/nfs |
nfsiostat | NFS I/O statistics | nfsiostat command |
Process & Job Monitoring
| Collector | Description | Requirements |
|---|
topprocs | Top processes by resource usage | /proc filesystem |
slurm_cgroup | Slurm cgroup statistics | Slurm cgroups |
self | Collector’s own resource usage | /proc/self |
Custom Collectors
| Collector | Description | Requirements |
|---|
customcmd | Execute custom commands to collect metrics | Any command/script |
Collector Lifecycle
Each collector implements these functions:
Init(config): Initializes the collector with configurationInitialized(): Returns whether initialization was successfulRead(duration, output): Reads metrics and sends to output channelClose(): Cleanup and shutdown
Example Configurations
Minimal System Monitoring
{
"cpustat": {},
"memstat": {},
"loadavg": {}
}
HPC Node Monitoring
{
"cpustat": {},
"memstat": {},
"diskstat": {},
"netstat": {},
"loadavg": {},
"tempstat": {},
"likwid": {
"access_mode": "direct",
"liblikwid_path": "/usr/local/lib/liblikwid.so",
"eventsets": [
{
"events": {
"cpu": ["FLOPS_DP", "CLOCK"]
}
}
]
},
"nvidia": {},
"ibstat": {}
}
Filesystem-Heavy Workload
{
"cpustat": {},
"memstat": {},
"diskstat": {},
"lustrestat": {},
"nfs4stat": {},
"iostat": {}
}
Minimal Overhead
{
"cpustat": {
"exclude_metrics": ["cpu_guest", "cpu_guest_nice", "cpu_steal"]
},
"memstat": {
"exclude_metrics": ["mem_slab", "mem_sreclaimable"]
}
}
Collector Development
Creating a Custom Collector
Collectors implement the MetricCollector interface. See collectors README for details.
Basic structure:
type SampleCollector struct {
metricCollector
config SampleCollectorConfig
}
func (m *SampleCollector) Init(config json.RawMessage) error
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric)
func (m *SampleCollector) Close()
Registration
Add your collector to collectorManager.go:
var AvailableCollectors = map[string]MetricCollector{
"sample": &SampleCollector{},
}
All collectors submit metrics in InfluxDB line protocol format via the CCMetric type.
Metric components:
- Name: Metric identifier (e.g.,
cpu_used) - Tags: Index-like key-value pairs (e.g.,
type=node, hostname=node01) - Fields: Data values (typically just
value) - Metadata: Source, group, unit information
- Timestamp: When the metric was collected
- Collector overhead: Each enabled collector adds CPU overhead
- I/O impact: Some collectors read many files (e.g., per-core statistics)
- Library overhead: GPU and hardware performance collectors can be expensive
- Selective metrics: Use
exclude_metrics to reduce unnecessary data
See Also