Overview
Collectors read data from various sources on the local system, parse it into metrics, and submit these metrics to the router. Each collector is a modular plugin that can be enabled or disabled independently.
Configuration Format
File: collectors.json
The collectors configuration is a set of objects (not a list), where each key is the collector type:
{
"collector_type": {
"collector_specific_option": "value"
}
}
Common Configuration Options
Most collectors support these common options:
| Option | Type | Default | Description |
|---|---|---|---|
exclude_metrics | []string | [] | List of metric names to exclude from forwarding to sinks |
send_meta | bool | varies | Send metadata information along with metrics |
Example:
{
"cpustat": {
"exclude_metrics": ["cpu_idle", "cpu_guest"]
},
"memstat": {}
}
Available Collectors
System Metrics
| Collector | Description | Source |
|---|---|---|
cpustat | CPU usage statistics | /proc/stat |
memstat | Memory usage statistics | /proc/meminfo |
loadavg | System load average | /proc/loadavg |
netstat | Network interface statistics | /proc/net/dev |
diskstat | Disk I/O statistics | /sys/block/*/stat |
iostat | Block device I/O statistics | /proc/diskstats |
Hardware Monitoring
| Collector | Description | Requirements |
|---|---|---|
tempstat | Temperature sensors | /sys/class/hwmon |
cpufreq | CPU frequency | /sys/devices/system |
cpufreq_cpuinfo | CPU frequency from cpuinfo | /proc/cpuinfo |
ipmistat | IPMI sensor data | ipmitool command |
Performance Monitoring
| Collector | Description | Requirements |
|---|---|---|
likwid | Hardware performance counters via LIKWID | liblikwid.so |
rapl | CPU energy consumption (RAPL) | /sys/class/powercap |
schedstat | CPU scheduler statistics | /proc/schedstat |
numastats | NUMA node statistics | /sys/devices/system/node |
GPU Monitoring
| Collector | Description | Requirements |
|---|---|---|
nvidia | NVIDIA GPU metrics | libnvidia-ml.so (NVML) |
rocm_smi | AMD ROCm GPU metrics | librocm_smi64.so |
Network & Storage
| Collector | Description | Requirements |
|---|---|---|
ibstat | InfiniBand statistics | /sys/class/infiniband |
lustrestat | Lustre filesystem statistics | Lustre client |
gpfs | GPFS filesystem statistics | GPFS utilities |
beegfs_meta | BeeGFS metadata statistics | BeeGFS metadata client |
beegfs_storage | BeeGFS storage statistics | BeeGFS storage client |
nfs3stat | NFS v3 statistics | /proc/net/rpc/nfs |
nfs4stat | NFS v4 statistics | /proc/net/rpc/nfs |
nfsiostat | NFS I/O statistics | nfsiostat command |
Process & Job Monitoring
| Collector | Description | Requirements |
|---|---|---|
topprocs | Top processes by resource usage | /proc filesystem |
slurm_cgroup | Slurm cgroup statistics | Slurm cgroups |
self | Collector’s own resource usage | /proc/self |
Custom Collectors
| Collector | Description | Requirements |
|---|---|---|
customcmd | Execute custom commands to collect metrics | Any command/script |
Collector Lifecycle
Each collector implements these functions:
Init(config): Initializes the collector with configurationInitialized(): Returns whether initialization was successfulRead(duration, output): Reads metrics and sends to output channelClose(): Cleanup and shutdown
Example Configurations
Minimal System Monitoring
{
"cpustat": {},
"memstat": {},
"loadavg": {}
}
HPC Node Monitoring
{
"cpustat": {},
"memstat": {},
"diskstat": {},
"netstat": {},
"loadavg": {},
"tempstat": {},
"likwid": {
"access_mode": "direct",
"liblikwid_path": "/usr/local/lib/liblikwid.so",
"eventsets": [
{
"events": {
"cpu": ["FLOPS_DP", "CLOCK"]
}
}
]
},
"nvidia": {},
"ibstat": {}
}
Filesystem-Heavy Workload
{
"cpustat": {},
"memstat": {},
"diskstat": {},
"lustrestat": {},
"nfs4stat": {},
"iostat": {}
}
Minimal Overhead
{
"cpustat": {
"exclude_metrics": ["cpu_guest", "cpu_guest_nice", "cpu_steal"]
},
"memstat": {
"exclude_metrics": ["mem_slab", "mem_sreclaimable"]
}
}
Collector Development
Creating a Custom Collector
Collectors implement the MetricCollector interface. See collectors README for details.
Basic structure:
type SampleCollector struct {
metricCollector
config SampleCollectorConfig
}
func (m *SampleCollector) Init(config json.RawMessage) error
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric)
func (m *SampleCollector) Close()
Registration
Add your collector to collectorManager.go:
var AvailableCollectors = map[string]MetricCollector{
"sample": &SampleCollector{},
}
Metric Format
All collectors submit metrics in InfluxDB line protocol format via the CCMetric type.
Metric components:
- Name: Metric identifier (e.g.,
cpu_used) - Tags: Index-like key-value pairs (e.g.,
type=node,hostname=node01) - Fields: Data values (typically just
value) - Metadata: Source, group, unit information
- Timestamp: When the metric was collected
Performance Considerations
- Collector overhead: Each enabled collector adds CPU overhead
- I/O impact: Some collectors read many files (e.g., per-core statistics)
- Library overhead: GPU and hardware performance collectors can be expensive
- Selective metrics: Use
exclude_metricsto reduce unnecessary data