This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

cc-metric-collector

ClusterCockpit Metric Collector References

Reference information regarding the ClusterCockpit component “cc-metric-collector” (GitHub Repo).

Overview

cc-metric-collector is a node agent for measuring, processing and forwarding node level metrics. It is part of the ClusterCockpit ecosystem.

The metric collector sends (and receives) metrics in the InfluxDB line protocol as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).

Key Features

  • Modular Architecture: Flexible plugin-based system with collectors, sinks, receivers, and router
  • Multiple Data Sources: Collect metrics from various sources (procfs, sysfs, hardware libraries, custom commands)
  • Flexible Output: Send metrics to multiple sinks simultaneously (InfluxDB, Prometheus, NATS, etc.)
  • On-the-fly Processing: Router can tag, filter, aggregate, and transform metrics before forwarding
  • Network Receiver: Accept metrics from other collectors to create hierarchical setups
  • Low Overhead: Efficient serial collection with single timestamp per interval

Architecture

There is a single timer loop that triggers all collectors serially, collects the data and sends the metrics to the configured sinks. This ensures all data is submitted with a single timestamp. The sinks currently use mostly blocking APIs.

The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.

flowchart LR
  subgraph col ["Collectors"]
  direction TB
  cpustat["cpustat"]
  memstat["memstat"]
  tempstat["tempstat"]
  misc["..."]
  end
  
  subgraph Receivers ["Receivers"]
  direction TB
  nats["NATS"]
  httprecv["HTTP"]
  miscrecv[...]
  end

  subgraph calc["Aggregator"]
  direction LR
  cache["Cache"]
  agg["Calculator"]
  end

  subgraph sinks ["Sinks"]
  direction RL
  influx["InfluxDB"]
  ganglia["Ganglia"]
  logger["Logfile"]
  miscsink["..."]
  end

  cpustat --> CollectorManager["CollectorManager"]
  memstat --> CollectorManager
  tempstat --> CollectorManager
  misc --> CollectorManager

  nats  --> ReceiverManager["ReceiverManager"]
  httprecv --> ReceiverManager
  miscrecv --> ReceiverManager

  CollectorManager --> newrouter["Router"]
  ReceiverManager -.-> newrouter
  calc -.-> newrouter
  newrouter --> SinkManager["SinkManager"]
  newrouter -.-> calc

  SinkManager --> influx
  SinkManager --> ganglia
  SinkManager --> logger
  SinkManager --> miscsink

Components

  • Collectors: Read data from local system sources (files, commands, libraries) and send to router
  • Router: Process metrics by caching, filtering, tagging, renaming, and aggregating
  • Sinks: Send metrics to storage backends (InfluxDB, Prometheus, NATS, etc.)
  • Receivers: Accept metrics from other collectors via network (HTTP, NATS) and forward to router

The key difference between collectors and receivers is that collectors are called periodically while receivers run continuously and submit metrics at any time.

Supported Metrics

Supported metrics are documented in the cc-specifications.

Deployment Scenarios

The metric collector was designed with flexibility in mind, so it can be used in many scenarios:

Direct to Database

flowchart TD
  subgraph a ["Cluster A"]
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  a --> db[(Database)]
  db <--> ccweb("Webfrontend")

Hierarchical Collection

flowchart TD
  subgraph a [ClusterA]
  direction LR
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  subgraph b [ClusterB]
  direction LR
  nodeD[NodeD with CC collector]
  nodeE[NodeE with CC collector]
  nodeF[NodeF with CC collector]
  end
  a --> ccrecv{"CC collector as receiver"}
  b --> ccrecv
  ccrecv --> db[("Database1")]
  ccrecv -.-> db2[("Database2")]
  db <-.-> ccweb("Webfrontend")

1 - Configuration

cc-metric-collector Configuration Reference

Configuration Overview

The configuration of cc-metric-collector consists of five configuration files: one global file and four component-related files.

Configuration is implemented using a single JSON document that can be distributed over the network and persisted as a file.

Global Configuration File

The global file contains paths to the other four component files and some global options.

Default location: /etc/cc-metric-collector/config.json (can be overridden with -config flag)

Example

{
  "sinks-file": "/etc/cc-metric-collector/sinks.json",
  "collectors-file": "/etc/cc-metric-collector/collectors.json",
  "receivers-file": "/etc/cc-metric-collector/receivers.json",
  "router-file": "/etc/cc-metric-collector/router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

Note: Paths are relative to the execution folder of the cc-metric-collector binary, so it is recommended to use absolute paths.

Configuration Reference

Config KeyTypeDefaultDescription
sinks-filestring-Path to sinks configuration file (relative or absolute)
collectors-filestring-Path to collectors configuration file (relative or absolute)
receivers-filestring-Path to receivers configuration file (relative or absolute)
router-filestring-Path to router configuration file (relative or absolute)
main.intervalstring10sHow often metrics should be read and sent to sinks. Parsed using time.ParseDuration()
main.durationstring1sHow long one measurement should take. Important for collectors like likwid that measure over time.

Alternative Configuration Format

Instead of separate files, you can embed component configurations directly:

{
  "sinks": {
    "mysink": {
      "type": "influxasync",
      "host": "localhost",
      "port": "8086"
    }
  },
  "collectors": {
    "cpustat": {}
  },
  "receivers": {},
  "router": {
    "interval_timestamp": false
  },
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

Component Configuration Files

Collectors Configuration

The collectors configuration file specifies which metrics should be queried from the system. See Collectors for available collectors and their configuration options.

Format: Unlike sinks and receivers, the collectors configuration is a set of objects (not a list).

File: collectors.json

Example:

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {
    "exclude_metrics": [
      "disk_total"
    ]
  },
  "likwid": {
    "access_mode": "direct",
    "liblikwid_path": "/usr/local/lib/liblikwid.so",
    "eventsets": [
      {
        "events": {
          "cpu": ["FLOPS_DP"]
        }
      }
    ]
  }
}

Common Options (available for most collectors):

OptionTypeDescription
exclude_metrics[]stringList of metric names to exclude from forwarding to sinks
send_metaboolSend metadata information along with metrics (default varies)

See: Collectors Documentation for collector-specific configuration options.

Note: Some collectors dynamically load shared libraries. Ensure the library path is part of the LD_LIBRARY_PATH environment variable.

Sinks Configuration

The sinks configuration file defines where metrics should be sent. Multiple sinks of the same or different types can be configured.

Format: Object with named sink configurations

File: sinks.json

Example:

{
  "local_influx": {
    "type": "influxasync",
    "host": "localhost",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "mytoken"
  },
  "central_prometheus": {
    "type": "prometheus",
    "host": "0.0.0.0",
    "port": "9091"
  },
  "debug_log": {
    "type": "stdout"
  }
}

Common Sink Types:

TypeDescription
influxasyncInfluxDB v2 asynchronous writer
influxdbInfluxDB v2 synchronous writer
prometheusPrometheus Pushgateway
natsNATS messaging system
stdoutStandard output (for debugging)
libgangliaGanglia monitoring system
httpGeneric HTTP endpoint

See: cc-lib Sinks Documentation for sink-specific configuration options.

Note: Some sinks dynamically load shared libraries. Ensure the library path is part of the LD_LIBRARY_PATH environment variable.

Router Configuration

The router sits between collectors/receivers and sinks, enabling metric processing such as tagging, filtering, renaming, and aggregation.

File: router.json

Simple Example:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "mycluster",
      "if": "*"
    }
  ],
  "interval_timestamp": false,
  "num_cache_intervals": 0
}

Advanced Example:

{
  "num_cache_intervals": 1,
  "interval_timestamp": true,
  "hostname_tag": "hostname",
  "max_forward": 50,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster"
        }
      }
    ]
  }
}

Configuration Reference:

OptionTypeDefaultDescription
interval_timestampboolfalseUse common timestamp (interval start) for all metrics in an interval
num_cache_intervalsint0Number of past intervals to cache (0 disables cache, required for interval aggregates)
hostname_tagstring"hostname"Tag name for hostname (added to locally created metrics)
max_forwardint50Max metrics to read from a channel at once (must be > 1)
process_messagesobject-Message processor configuration (see below)

See: Router Documentation for detailed configuration options and Message Processor for advanced processing.

Receivers Configuration

Receivers enable cc-metric-collector to accept metrics from other collectors via network protocols. For most standalone setups, this file can contain only an empty JSON map ({}).

File: receivers.json

Example:

{
  "nats_rack0": {
    "type": "nats",
    "address": "nats-server.example.org",
    "port": "4222",
    "subject": "rack0"
  },
  "http_receiver": {
    "type": "http",
    "address": "0.0.0.0",
    "port": "8080",
    "path": "/api/write"
  }
}

Common Receiver Types:

TypeDescription
natsNATS subscriber
httpHTTP server endpoint for metric ingestion

See: cc-lib Receivers Documentation for receiver-specific configuration options.

Configuration Examples

Complete example configurations can be found in the example-configs directory of the repository.

Configuration Validation

To validate your configuration before running the collector:

# Test configuration loading
cc-metric-collector -config /path/to/config.json -once

The -once flag runs all collectors only once and exits, useful for testing.

2 - Installation

Building and installing cc-metric-collector

Building from Source

Prerequisites

  • Go 1.16 or higher
  • Git
  • Make
  • Standard build tools (gcc, etc.)

Basic Build

In most cases, a simple make in the main folder is enough to get a cc-metric-collector binary:

git clone https://github.com/ClusterCockpit/cc-metric-collector.git
cd cc-metric-collector
make

The build process automatically:

  • Downloads dependencies via go get
  • Checks for LIKWID library (for LIKWID collector)
  • Downloads and builds LIKWID as a static library if not found
  • Copies required header files for cgo bindings

Build Output

After successful build, you’ll have:

  • cc-metric-collector binary in the project root
  • LIKWID library and headers (if LIKWID collector was built)

System Integration

Configuration Files

Create a directory for configuration files:

sudo mkdir -p /etc/cc-metric-collector
sudo cp example-configs/*.json /etc/cc-metric-collector/

Edit the configuration files according to your needs. See Configuration for details.

User and Group Setup

It’s recommended to run cc-metric-collector as a dedicated user:

sudo useradd -r -s /bin/false cc-metric-collector
sudo mkdir -p /var/log/cc-metric-collector
sudo chown cc-metric-collector:cc-metric-collector /var/log/cc-metric-collector

Pre-configuration

The main configuration settings for system integration are pre-defined in scripts/cc-metric-collector.config. This file contains:

  • UNIX user and group for execution
  • PID file location
  • Other system settings

Adjust and install it:

# Edit the configuration
editor scripts/cc-metric-collector.config

# Install to system location
sudo install --mode 644 \
             --owner root \
             --group root \
             scripts/cc-metric-collector.config /etc/default/cc-metric-collector

Systemd Integration

If you are using systemd as your init system:

# Install the systemd service file
sudo install --mode 644 \
             --owner root \
             --group root \
             scripts/cc-metric-collector.service /etc/systemd/system/cc-metric-collector.service

# Reload systemd daemon
sudo systemctl daemon-reload

# Enable the service to start on boot
sudo systemctl enable cc-metric-collector

# Start the service
sudo systemctl start cc-metric-collector

# Check status
sudo systemctl status cc-metric-collector

SysVinit Integration

If you are using an init system based on /etc/init.d daemons:

# Install the init script
sudo install --mode 755 \
             --owner root \
             --group root \
             scripts/cc-metric-collector.init /etc/init.d/cc-metric-collector

# Enable the service
sudo update-rc.d cc-metric-collector defaults

# Start the service
sudo /etc/init.d/cc-metric-collector start

The init script reads basic configuration from /etc/default/cc-metric-collector.

Package Installation

RPM Packages

To build RPM packages:

make RPM

Requirements:

  • RPM tools (rpm and rpmspec)
  • Git

The command uses the RPM SPEC file scripts/cc-metric-collector.spec and creates packages in the project directory.

Install the generated RPM:

sudo rpm -ivh cc-metric-collector-*.rpm

DEB Packages

To build Debian packages:

make DEB

Requirements:

  • dpkg-deb
  • awk, sed
  • Git

The command uses the DEB control file scripts/cc-metric-collector.control and creates a binary deb package.

Install the generated DEB:

sudo dpkg -i cc-metric-collector_*.deb

Note: DEB package creation is experimental and not as well tested as RPM packages.

Customizing Packages

To customize RPM or DEB packages for your local system:

  1. Fork the cc-metric-collector repository
  2. Enable GitHub Actions in your fork
  3. Make changes to scripts, code, etc.
  4. Commit and push your changes
  5. Tag the commit: git tag v0.x.y-myversion
  6. Push tags: git push --tags
  7. Wait for the Release action to complete
  8. Download RPMs/DEBs from the Releases page of your fork

Library Dependencies

LIKWID Collector

The LIKWID collector requires the LIKWID library. There is currently no Golang interface to LIKWID, so cgo is used to create bindings.

The build process handles LIKWID automatically:

  • Checks if LIKWID is installed system-wide
  • If not found, downloads and builds LIKWID with direct access mode
  • Copies necessary header files

To use a pre-installed LIKWID:

export LD_LIBRARY_PATH=/path/to/likwid/lib:$LD_LIBRARY_PATH

Other Dynamic Libraries

Some collectors and sinks dynamically load shared libraries:

ComponentLibraryPurpose
LIKWID collectorliblikwid.soHardware performance data
NVIDIA collectorlibnvidia-ml.soNVIDIA GPU metrics
ROCm collectorlibrocm_smi64.soAMD GPU metrics
Ganglia sinklibganglia.soGanglia metric submission

Ensure required libraries are in your LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Permissions

Hardware Access

Some collectors require special permissions:

CollectorRequirementSolution
LIKWID (direct)Direct hardware accessRun as root or use capabilities
IPMIAccess to IPMI devicesUser must be in ipmi group
TemperatureAccess to /sys/class/hwmonUsually readable by all users
GPU collectorsAccess to GPU management librariesUser must have GPU access rights

Setting Capabilities (Alternative to Root)

For LIKWID direct access without running as root:

sudo setcap cap_sys_rawio=ep /path/to/cc-metric-collector

Warning: Direct hardware access can be dangerous if misconfigured. Use with caution.

Verification

After installation, verify the collector is working:

# Test configuration
cc-metric-collector -config /etc/cc-metric-collector/config.json -once

# Check logs
journalctl -u cc-metric-collector -f

# Or for SysV
tail -f /var/log/cc-metric-collector/collector.log

Troubleshooting

Common Issues

Issue: cannot find liblikwid.so

  • Solution: Set LD_LIBRARY_PATH or configure in systemd service file

Issue: permission denied accessing hardware

  • Solution: Run as root, use capabilities, or adjust file permissions

Issue: Configuration file not found

  • Solution: Use -config flag or place config.json in execution directory

Issue: Metrics not appearing in sink

  • Solution: Check sink configuration, network connectivity, and router settings

Debug Mode

Run in foreground with debug output:

cc-metric-collector -config /path/to/config.json -log stderr

Run collectors only once for testing:

cc-metric-collector -config /path/to/config.json -once

3 - Usage

Running and using cc-metric-collector

Command Line Interface

Basic Usage

cc-metric-collector [options]

Command Line Options

FlagTypeDefaultDescription
-configstring./config.jsonPath to configuration file
-logstringstderrPath for logfile (use stderr for console)
-onceboolfalseRun all collectors only once then exit

Examples

Run with default configuration:

cc-metric-collector

Run with custom configuration:

cc-metric-collector -config /etc/cc-metric-collector/config.json

Log to file:

cc-metric-collector -config /etc/cc-metric-collector/config.json \
                    -log /var/log/cc-metric-collector/collector.log

Test configuration (run once):

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

This runs all collectors exactly once and exits. Useful for:

  • Testing configuration
  • Debugging collector issues
  • Validating metric output
  • One-time metric collection

Running as a Service

Systemd

Start service:

sudo systemctl start cc-metric-collector

Stop service:

sudo systemctl stop cc-metric-collector

Restart service:

sudo systemctl restart cc-metric-collector

Check status:

sudo systemctl status cc-metric-collector

View logs:

journalctl -u cc-metric-collector -f

Enable on boot:

sudo systemctl enable cc-metric-collector

SysVinit

Start service:

sudo /etc/init.d/cc-metric-collector start

Stop service:

sudo /etc/init.d/cc-metric-collector stop

Restart service:

sudo /etc/init.d/cc-metric-collector restart

Check status:

sudo /etc/init.d/cc-metric-collector status

Operation Modes

Daemon Mode (Default)

In daemon mode, cc-metric-collector runs continuously with a timer loop that:

  1. Triggers all enabled collectors serially
  2. Collects metrics with a single timestamp per interval
  3. Forwards metrics through the router
  4. Sends processed metrics to all configured sinks
  5. Sleeps until the next interval

Interval timing is controlled by the main.interval configuration parameter.

One-Shot Mode

Activated with the -once flag, this mode:

  1. Initializes all collectors
  2. Runs each collector exactly once
  3. Processes and forwards metrics
  4. Exits

Useful for:

  • Configuration testing
  • Debugging
  • Cron-based metric collection
  • Integration with other monitoring tools

Metric Collection Flow

sequenceDiagram
    participant Timer
    participant Collectors
    participant Router
    participant Sinks
    
    Timer->>Collectors: Trigger (every interval)
    Collectors->>Collectors: Read metrics from system
    Collectors->>Router: Forward metrics
    Router->>Router: Process (tag, filter, aggregate)
    Router->>Sinks: Send processed metrics
    Sinks->>Sinks: Write to backends
    Timer->>Timer: Sleep until next interval

Common Usage Patterns

Basic Monitoring Setup

Collect basic system metrics and send to InfluxDB:

config.json:

{
  "collectors-file": "./collectors.json",
  "sinks-file": "./sinks.json",
  "receivers-file": "./receivers.json",
  "router-file": "./router.json",
  "main": {
    "interval": "10s",
    "duration": "1s"
  }
}

collectors.json:

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {}
}

sinks.json:

{
  "influx": {
    "type": "influxasync",
    "host": "influx.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "mytoken"
  }
}

router.json:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "production",
      "if": "*"
    }
  ],
  "interval_timestamp": true
}

receivers.json:

{}

HPC Node Monitoring

Extended monitoring for HPC compute nodes:

collectors.json:

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {},
  "likwid": {
    "access_mode": "direct",
    "liblikwid_path": "/usr/local/lib/liblikwid.so",
    "eventsets": [
      {
        "events": {
          "cpu": ["FLOPS_DP", "CLOCK"]
        }
      }
    ]
  },
  "nvidia": {},
  "ibstat": {}
}

Hierarchical Collection

Compute nodes send to aggregation node:

Node config - sinks.json:

{
  "nats_aggregator": {
    "type": "nats",
    "host": "aggregator.example.org",
    "port": "4222",
    "subject": "cluster.rack1"
  }
}

Aggregation node config - receivers.json:

{
  "nats_rack1": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "cluster.rack1"
  },
  "nats_rack2": {
    "type": "nats",
    "address": "localhost",
    "port": "4222",
    "subject": "cluster.rack2"
  }
}

Aggregation node config - sinks.json:

{
  "influx": {
    "type": "influxasync",
    "host": "influx.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "mytoken"
  }
}

Multi-Sink Configuration

Send metrics to multiple destinations:

sinks.json:

{
  "primary_influx": {
    "type": "influxasync",
    "host": "influx1.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "token1"
  },
  "backup_influx": {
    "type": "influxasync",
    "host": "influx2.example.org",
    "port": "8086",
    "organization": "myorg",
    "database": "metrics",
    "password": "token2"
  },
  "prometheus": {
    "type": "prometheus",
    "host": "0.0.0.0",
    "port": "9091"
  }
}

Monitoring and Debugging

Check Collector Status

Use -once mode to test without running continuously:

cc-metric-collector -config /etc/cc-metric-collector/config.json -once

Debug Output

Log to stderr for immediate feedback:

cc-metric-collector -config /etc/cc-metric-collector/config.json -log stderr

Verify Metrics

Check what metrics are being collected:

  1. Configure stdout sink temporarily
  2. Run in -once mode
  3. Observe metric output

Temporary debug sink:

{
  "debug": {
    "type": "stdout"
  }
}

Common Issues

No metrics appearing:

  • Check collector configuration
  • Verify collectors have required permissions
  • Ensure sinks are reachable
  • Check router isn’t filtering metrics

High CPU usage:

  • Increase main.interval value
  • Disable expensive collectors
  • Check for router performance issues

Memory growth:

  • Reduce num_cache_intervals in router
  • Check for sink write failures
  • Verify metric cardinality isn’t excessive

Performance Tuning

Interval Adjustment

Faster updates (more overhead):

{
  "main": {
    "interval": "5s",
    "duration": "1s"
  }
}

Slower updates (less overhead):

{
  "main": {
    "interval": "60s",
    "duration": "1s"
  }
}

Collector Selection

Only enable collectors you need:

{
  "cpustat": {},
  "memstat": {}
}

Metric Filtering

Use router to exclude unwanted metrics:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["cpu_idle", "cpu_iowait"]
      }
    ]
  }
}

Security Considerations

Running as Non-Root

Most collectors work without root privileges, except:

  • LIKWID (direct mode)
  • IPMI collector
  • Some hardware-specific collectors

Use capabilities instead of root when possible.

Network Security

When using receivers:

  • Use authentication (NATS credentials, HTTP tokens)
  • Restrict listening addresses
  • Use TLS for encrypted transport
  • Firewall receiver ports appropriately

File Permissions

Protect configuration files containing credentials:

sudo chmod 600 /etc/cc-metric-collector/config.json
sudo chown cc-metric-collector:cc-metric-collector /etc/cc-metric-collector/config.json

4 - Metric Router

Routing and processing metrics in cc-metric-collector

Overview

The metric router sits between collectors/receivers and sinks, enabling metric processing such as:

  • Adding and removing tags
  • Filtering and dropping metrics
  • Renaming metrics
  • Aggregating metrics across an interval
  • Normalizing units
  • Setting common timestamps

Basic Configuration

File: router.json

Minimal configuration:

{
  "interval_timestamp": false,
  "num_cache_intervals": 0
}

Typical configuration:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "mycluster",
      "if": "*"
    }
  ],
  "interval_timestamp": true,
  "num_cache_intervals": 0
}

Configuration Options

Core Settings

OptionTypeDefaultDescription
interval_timestampboolfalseUse common timestamp (interval start) for all metrics in an interval
num_cache_intervalsint0Number of past intervals to cache (0 disables cache, required for interval aggregates)
hostname_tagstring"hostname"Tag name for hostname (added to locally created metrics)
max_forwardint50Max metrics to read from a channel at once (must be > 1)

The interval_timestamp Option

Collectors’ Read() functions are not called simultaneously, so metrics within an interval can have different timestamps.

When true: All metrics in an interval get a common timestamp (the interval start time) When false: Each metric keeps its original collection timestamp

Use case: Enable this to simplify time-series alignment in your database.

The num_cache_intervals Option

Controls metric caching for interval aggregations.

ValueBehavior
0Cache disabled (no aggregations possible)
1Cache last interval only (minimal memory, basic aggregations)
2+Cache multiple intervals (for complex time-based aggregations)

Note: Required to be > 0 for interval_aggregates to work.

The hostname_tag Option

By default, the router tags locally created metrics with the hostname.

Default tag name: hostname

Custom tag name:

{
  "hostname_tag": "node"
}

The max_forward Option

Performance tuning for metric processing.

How it works: When the router receives a metric, it tries to read up to max_forward additional metrics from the same channel before processing.

Default: 50

Must be: Greater than 1

Metric Processing

Use the process_messages section with the message processor:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster",
          "partition": "compute"
        }
      },
      {
        "drop_by_name": ["cpu_idle", "mem_cached"]
      },
      {
        "rename_by": {
          "clock_mhz": "clock"
        }
      }
    ]
  }
}

Legacy Configuration (Deprecated)

The following options are deprecated but still supported for backward compatibility. They are automatically converted to process_messages format.

Adding Tags

Deprecated syntax:

{
  "add_tags": [
    {
      "key": "cluster",
      "value": "mycluster",
      "if": "*"
    },
    {
      "key": "type",
      "value": "socket",
      "if": "name == 'temp_package_id_0'"
    }
  ]
}

Modern equivalent:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "mycluster"
        }
      },
      {
        "add_tags_by": {
          "type": "socket"
        },
        "if": "name == 'temp_package_id_0'"
      }
    ]
  }
}

Deleting Tags

Deprecated syntax:

{
  "delete_tags": [
    {
      "key": "unit",
      "if": "*"
    }
  ]
}

Never delete these tags: hostname, type, type-id

Dropping Metrics

By name (deprecated):

{
  "drop_metrics": [
    "not_interesting_metric",
    "debug_metric"
  ]
}

By condition (deprecated):

{
  "drop_metrics_if": [
    "match('temp_core_%d+', name)",
    "match('cpu', type) && type-id == 0"
  ]
}

Modern equivalent:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "drop_by_name": ["not_interesting_metric", "debug_metric"]
      },
      {
        "drop_by": "match('temp_core_%d+', name)"
      }
    ]
  }
}

Renaming Metrics

Deprecated syntax:

{
  "rename_metrics": {
    "old_name": "new_name",
    "clock_mhz": "clock"
  }
}

Modern equivalent:

{
  "process_messages": {
    "manipulate_messages": [
      {
        "rename_by": {
          "old_name": "new_name",
          "clock_mhz": "clock"
        }
      }
    ]
  }
}

Use case: Standardize metric names across different systems or collectors.

Normalizing Units

Deprecated syntax:

{
  "normalize_units": true
}

Effect: Normalizes unit names (e.g., byte, Byte, B, bytes → consistent format)

Changing Unit Prefixes

Deprecated syntax:

{
  "change_unit_prefix": {
    "mem_used": "G",
    "mem_total": "G"
  }
}

Use case: Convert memory metrics from kB (as reported by /proc/meminfo) to GB for better readability.

Interval Aggregates (Experimental)

Requires: num_cache_intervals > 0

Derive new metrics by aggregating metrics from the current interval.

Configuration

{
  "num_cache_intervals": 1,
  "interval_aggregates": [
    {
      "name": "temp_cores_avg",
      "if": "match('temp_core_%d+', metric.Name())",
      "function": "avg(values)",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "IPMI",
        "unit": "degC",
        "source": "TempCollector"
      }
    }
  ]
}

Parameters

FieldTypeDescription
namestringName of the new derived metric
ifstringCondition to select which metrics to aggregate
functionstringAggregation function (e.g., avg(values), sum(values), max(values))
tagsobjectTags to add to the derived metric
metaobjectMetadata for the derived metric (use "<copy>" to copy from source metrics)

Available Functions

FunctionDescription
avg(values)Average of all matching metrics
sum(values)Sum of all matching metrics
min(values)Minimum value
max(values)Maximum value
count(values)Number of matching metrics

Complex Example

Calculate mem_used from multiple memory metrics:

{
  "interval_aggregates": [
    {
      "name": "mem_used",
      "if": "source == 'MemstatCollector'",
      "function": "sum(mem_total) - (sum(mem_free) + sum(mem_buffers) + sum(mem_cached))",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "<copy>",
        "unit": "<copy>",
        "source": "<copy>"
      }
    }
  ]
}

Dropping Source Metrics

If you only want the aggregated metric, drop the source metrics:

{
  "drop_metrics_if": [
    "match('temp_core_%d+', metric.Name())"
  ],
  "interval_aggregates": [
    {
      "name": "temp_cores_avg",
      "if": "match('temp_core_%d+', metric.Name())",
      "function": "avg(values)",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "IPMI",
        "unit": "degC"
      }
    }
  ]
}

Processing Order

The router processes metrics in a specific order:

  1. Add hostname_tag (if sent by collectors or cache)
  2. Change timestamp to interval timestamp (if interval_timestamp == true)
  3. Check if metric should be dropped (drop_metrics, drop_metrics_if)
  4. Add tags (add_tags)
  5. Delete tags (del_tags)
  6. Rename metric (rename_metrics) and store old name in meta as oldname
  7. Add tags again (to support conditions using new name)
  8. Delete tags again (to support conditions using new name)
  9. Normalize units (if normalize_units == true)
  10. Convert unit prefix (change_unit_prefix)
  11. Send to sinks
  12. Move to cache (if num_cache_intervals > 0)

Legend:

  • Operations apply to metrics from collectors (c)
  • Operations apply to metrics from receivers (r)
  • Operations apply to both (c,r)

Complete Example

{
  "interval_timestamp": true,
  "num_cache_intervals": 1,
  "hostname_tag": "hostname",
  "max_forward": 50,
  "process_messages": {
    "manipulate_messages": [
      {
        "add_base_tags": {
          "cluster": "production",
          "datacenter": "dc1"
        }
      },
      {
        "drop_by_name": ["cpu_idle", "cpu_guest", "cpu_guest_nice"]
      },
      {
        "rename_by": {
          "clock_mhz": "clock"
        }
      },
      {
        "add_tags_by": {
          "high_temp": "true"
        },
        "if": "name == 'temp_package_id_0' && value > 70"
      }
    ]
  },
  "interval_aggregates": [
    {
      "name": "temp_avg",
      "if": "match('temp_core_%d+', name)",
      "function": "avg(values)",
      "tags": {
        "type": "node"
      },
      "meta": {
        "group": "Temperature",
        "unit": "degC",
        "source": "TempCollector"
      }
    }
  ]
}

Performance Considerations

  • Caching: Only enable if you need interval aggregates (memory overhead)
  • Complex conditions: Evaluated for every metric (CPU overhead)
  • Aggregations: Evaluated at the start of each interval (CPU overhead)
  • max_forward: Higher values can improve throughput but increase latency

See Also

5 - Collectors

Available metric collectors for cc-metric-collector

Overview

Collectors read data from various sources on the local system, parse it into metrics, and submit these metrics to the router. Each collector is a modular plugin that can be enabled or disabled independently.

Configuration Format

File: collectors.json

The collectors configuration is a set of objects (not a list), where each key is the collector type:

{
  "collector_type": {
    "collector_specific_option": "value"
  }
}

Common Configuration Options

Most collectors support these common options:

OptionTypeDefaultDescription
exclude_metrics[]string[]List of metric names to exclude from forwarding to sinks
send_metaboolvariesSend metadata information along with metrics

Example:

{
  "cpustat": {
    "exclude_metrics": ["cpu_idle", "cpu_guest"]
  },
  "memstat": {}
}

Available Collectors

System Metrics

CollectorDescriptionSource
cpustatCPU usage statistics/proc/stat
memstatMemory usage statistics/proc/meminfo
loadavgSystem load average/proc/loadavg
netstatNetwork interface statistics/proc/net/dev
diskstatDisk I/O statistics/sys/block/*/stat
iostatBlock device I/O statistics/proc/diskstats

Hardware Monitoring

CollectorDescriptionRequirements
tempstatTemperature sensors/sys/class/hwmon
cpufreqCPU frequency/sys/devices/system
cpufreq_cpuinfoCPU frequency from cpuinfo/proc/cpuinfo
ipmistatIPMI sensor dataipmitool command

Performance Monitoring

CollectorDescriptionRequirements
likwidHardware performance counters via LIKWIDliblikwid.so
raplCPU energy consumption (RAPL)/sys/class/powercap
schedstatCPU scheduler statistics/proc/schedstat
numastatsNUMA node statistics/sys/devices/system/node

GPU Monitoring

CollectorDescriptionRequirements
nvidiaNVIDIA GPU metricslibnvidia-ml.so (NVML)
rocm_smiAMD ROCm GPU metricslibrocm_smi64.so

Network & Storage

CollectorDescriptionRequirements
ibstatInfiniBand statistics/sys/class/infiniband
lustrestatLustre filesystem statisticsLustre client
gpfsGPFS filesystem statisticsGPFS utilities
beegfs_metaBeeGFS metadata statisticsBeeGFS metadata client
beegfs_storageBeeGFS storage statisticsBeeGFS storage client
nfs3statNFS v3 statistics/proc/net/rpc/nfs
nfs4statNFS v4 statistics/proc/net/rpc/nfs
nfsiostatNFS I/O statisticsnfsiostat command

Process & Job Monitoring

CollectorDescriptionRequirements
topprocsTop processes by resource usage/proc filesystem
slurm_cgroupSlurm cgroup statisticsSlurm cgroups
selfCollector’s own resource usage/proc/self

Custom Collectors

CollectorDescriptionRequirements
customcmdExecute custom commands to collect metricsAny command/script

Collector Lifecycle

Each collector implements these functions:

  • Init(config): Initializes the collector with configuration
  • Initialized(): Returns whether initialization was successful
  • Read(duration, output): Reads metrics and sends to output channel
  • Close(): Cleanup and shutdown

Example Configurations

Minimal System Monitoring

{
  "cpustat": {},
  "memstat": {},
  "loadavg": {}
}

HPC Node Monitoring

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "netstat": {},
  "loadavg": {},
  "tempstat": {},
  "likwid": {
    "access_mode": "direct",
    "liblikwid_path": "/usr/local/lib/liblikwid.so",
    "eventsets": [
      {
        "events": {
          "cpu": ["FLOPS_DP", "CLOCK"]
        }
      }
    ]
  },
  "nvidia": {},
  "ibstat": {}
}

Filesystem-Heavy Workload

{
  "cpustat": {},
  "memstat": {},
  "diskstat": {},
  "lustrestat": {},
  "nfs4stat": {},
  "iostat": {}
}

Minimal Overhead

{
  "cpustat": {
    "exclude_metrics": ["cpu_guest", "cpu_guest_nice", "cpu_steal"]
  },
  "memstat": {
    "exclude_metrics": ["mem_slab", "mem_sreclaimable"]
  }
}

Collector Development

Creating a Custom Collector

Collectors implement the MetricCollector interface. See collectors README for details.

Basic structure:

type SampleCollector struct {
    metricCollector
    config SampleCollectorConfig
}

func (m *SampleCollector) Init(config json.RawMessage) error
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric)
func (m *SampleCollector) Close()

Registration

Add your collector to collectorManager.go:

var AvailableCollectors = map[string]MetricCollector{
    "sample": &SampleCollector{},
}

Metric Format

All collectors submit metrics in InfluxDB line protocol format via the CCMetric type.

Metric components:

  • Name: Metric identifier (e.g., cpu_used)
  • Tags: Index-like key-value pairs (e.g., type=node, hostname=node01)
  • Fields: Data values (typically just value)
  • Metadata: Source, group, unit information
  • Timestamp: When the metric was collected

Performance Considerations

  • Collector overhead: Each enabled collector adds CPU overhead
  • I/O impact: Some collectors read many files (e.g., per-core statistics)
  • Library overhead: GPU and hardware performance collectors can be expensive
  • Selective metrics: Use exclude_metrics to reduce unnecessary data

See Also