This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

cc-metric-store

ClusterCockpit Metric Store References

Reference information regarding the ClusterCockpit component “cc-metric-store” (GitHub Repo).

Query Requests

The metric store provides a flexible API for querying time-series metric data with support for hierarchical selectors, aggregation, and scope transformation.

APIQueryRequest

The main request structure for batch metric queries.

type APIQueryRequest struct {
    Cluster     string     `json:"cluster"`
    Queries     []APIQuery `json:"queries"`
    ForAllNodes []string   `json:"for-all-nodes"`
    From        int64      `json:"from"`
    To          int64      `json:"to"`
    WithStats   bool       `json:"with-stats"`
    WithData    bool       `json:"with-data"`
    WithPadding bool       `json:"with-padding"`
}

Fields:

  • Cluster (string): The cluster name to query
  • Queries ([]APIQuery): List of individual metric queries (see below)
  • ForAllNodes ([]string): Alternative to explicit queries - automatically generates queries for all specified metrics across all nodes in the cluster
  • From (int64): Start timestamp (Unix epoch seconds)
  • To (int64): End timestamp (Unix epoch seconds)
  • WithStats (bool): Include computed statistics (avg, min, max) in response
  • WithData (bool): Include raw time-series data in response
  • WithPadding (bool): Pad data arrays with NaN values to align with requested time range

Query Modes:

  1. Explicit Queries: Specify individual queries via the Queries field for fine-grained control
  2. Batch Mode: Use ForAllNodes to automatically query all specified metrics for all nodes in the cluster

Validation:

  • From must be less than To (returns ErrInvalidTimeRange otherwise)
  • Cluster is required when using ForAllNodes (returns ErrEmptyCluster otherwise)

APIQuery

Represents a single metric query with optional hierarchical selectors.

type APIQuery struct {
    Type        *string      `json:"type,omitempty"`
    SubType     *string      `json:"subtype,omitempty"`
    Metric      string       `json:"metric"`
    Hostname    string       `json:"host"`
    Resolution  int64        `json:"resolution"`
    TypeIds     []string     `json:"type-ids,omitempty"`
    SubTypeIds  []string     `json:"subtype-ids,omitempty"`
    ScaleFactor schema.Float `json:"scale-by,omitempty"`
    Aggregate   bool         `json:"aggreg"`
}

Fields:

  • Metric (string, required): The metric name to query (e.g., “cpu_load”, “mem_used”)
  • Hostname (string, required): The node hostname to query
  • Type (*string, optional): First level of hierarchy (e.g., “hwthread”, “core”, “socket”, “accelerator”, “memorydomain”)
  • TypeIds ([]string, optional): IDs for the Type level (e.g., [“0”, “1”, “2”] for cores 0-2)
  • SubType (*string, optional): Second level of hierarchy (for nested selectors)
  • SubTypeIds ([]string, optional): IDs for the SubType level
  • Resolution (int64): Data resolution in seconds (0 = native resolution)
  • ScaleFactor (float, optional): Multiply all data points by this factor (for unit conversion)
  • Aggregate (bool): If true, aggregate data from multiple TypeIds/SubTypeIds; if false, return separate results for each

Hierarchical Selection:

The query system supports hierarchical data selection:

Cluster → Hostname → Type+TypeIds → SubType+SubTypeIds

Examples:

// Query node-level CPU load
{
  "metric": "cpu_load",
  "host": "node001",
  "resolution": 60
}

// Query per-core CPU load (non-aggregated)
{
  "metric": "cpu_load",
  "host": "node001",
  "type": "core",
  "type-ids": ["0", "1", "2", "3"],
  "aggreg": false,
  "resolution": 60
}

// Query aggregated socket memory bandwidth
{
  "metric": "mem_bw",
  "host": "node001",
  "type": "socket",
  "type-ids": ["0", "1"],
  "aggreg": true,
  "resolution": 60
}

// Query GPU metrics
{
  "metric": "gpu_power",
  "host": "node001",
  "type": "accelerator",
  "type-ids": ["0", "1", "2", "3"],
  "aggreg": false,
  "resolution": 60
}

APIQueryResponse

The response structure containing query results.

type APIQueryResponse struct {
    Queries []APIQuery        `json:"queries,omitempty"`
    Results [][]APIMetricData `json:"results"`
}

Fields:

  • Queries ([]APIQuery, optional): Echo of the queries executed (populated when using ForAllNodes)
  • Results ([][]APIMetricData): 2D array of results where:
    • Outer array: One element per query
    • Inner array: One element per selector (e.g., multiple cores/sockets when Aggregate=false)

APIMetricData

Represents the response data for a single metric query.

type APIMetricData struct {
    Error      *string           `json:"error,omitempty"`
    Data       schema.FloatArray `json:"data,omitempty"`
    From       int64             `json:"from"`
    To         int64             `json:"to"`
    Resolution int64             `json:"resolution"`
    Avg        schema.Float      `json:"avg"`
    Min        schema.Float      `json:"min"`
    Max        schema.Float      `json:"max"`
}

Fields:

  • Data ([]float): Time-series data points (omitted if WithData=false)
  • From (int64): Actual start timestamp of returned data
  • To (int64): Actual end timestamp of returned data
  • Resolution (int64): Actual resolution of returned data in seconds
  • Avg (float): Average value (only if WithStats=true)
  • Min (float): Minimum value (only if WithStats=true)
  • Max (float): Maximum value (only if WithStats=true)
  • Error (*string, optional): Error message if query failed

Notes:

  • NaN values in data are ignored during statistics computation
  • If all values are NaN, statistics will be NaN
  • Missing hosts or metrics result in empty results (not errors) for graceful frontend handling

Metric Scopes

Metrics are collected at different granularities (native scope):

  • HWThread: Per hardware thread
  • Core: Per CPU core
  • Socket: Per CPU socket
  • MemoryDomain: Per memory domain (NUMA)
  • Accelerator: Per GPU/accelerator
  • Node: Per compute node

Scope Transformation

The query system automatically transforms between native metric scope and requested scope:

  • Aggregation (native scope ≥ requested scope): Finer-grained data is aggregated to coarser granularity
    • Example: HWThread → Core → Socket → Node
  • Rejection (native scope < requested scope): Cannot increase granularity - returns error
  • Special Cases: Accelerator metrics are independent of CPU hierarchy

Transformation Rules:

Native ScopeRequested ScopeResult
HWThreadHWThreadDirect query
HWThreadCoreAggregate HWThreads per core
HWThreadSocketAggregate HWThreads per socket
HWThreadNodeAggregate all HWThreads
CoreCoreDirect query
CoreSocketAggregate cores per socket
CoreNodeAggregate all cores
SocketSocketDirect query
SocketNodeAggregate all sockets
NodeNodeDirect query
AcceleratorAcceleratorDirect query
AcceleratorNodeAggregate all accelerators

Error Handling

The API uses a hybrid error model:

  1. Request-level errors: Returned as HTTP errors

    • ErrInvalidTimeRange: FromTo
    • ErrEmptyCluster: Missing cluster name with ForAllNodes
    • Uninitialized metric store
  2. Query-level errors: Stored in APIMetricData.Error field

    • Individual query failures don’t fail the entire request
    • Missing hosts/metrics are logged as warnings but return empty results
  3. Partial errors: When some queries succeed and others fail

    • Successful data is returned
    • Error messages are collected and returned as a combined error

Complete Example

{
  "cluster": "fritz",
  "from": 1609459200,
  "to": 1609462800,
  "with-stats": true,
  "with-data": true,
  "queries": [
    {
      "metric": "cpu_load",
      "host": "node001",
      "resolution": 60
    },
    {
      "metric": "mem_used",
      "host": "node001",
      "type": "socket",
      "type-ids": ["0", "1"],
      "aggreg": false,
      "resolution": 60
    }
  ]
}

Response:

{
  "results": [
    [
      {
        "data": [0.5, 0.6, 0.7, ...],
        "from": 1609459200,
        "to": 1609462800,
        "resolution": 60,
        "avg": 0.6,
        "min": 0.5,
        "max": 0.7
      }
    ],
    [
      {
        "data": [1024.0, 1536.0, 2048.0, ...],
        "from": 1609459200,
        "to": 1609462800,
        "resolution": 60,
        "avg": 1536.0,
        "min": 1024.0,
        "max": 2048.0
      },
      {
        "data": [2048.0, 2560.0, 3072.0, ...],
        "from": 1609459200,
        "to": 1609462800,
        "resolution": 60,
        "avg": 2560.0,
        "min": 2048.0,
        "max": 3072.0
      }
    ]
  ]
}

1 - Command Line

ClusterCockpit Metric Store Command Line Options

This page describes the command line options for the cc-metric-store executable.


  -config <path>

Function: Specifies alternative path to application configuration file.

Default: ./config.json

Example: -config ./configfiles/configuration.json


  -dev

Function: Enables the Swagger UI REST API documentation and playground at /swagger/.


  -gops

Function: Go server listens via github.com/google/gops/agent (for debugging).


  -loglevel <level>

Function: Sets the logging level.

Options: debug, info, warn (default), err, crit

Example: -loglevel debug


  -logdate

Function: Add date and time to log messages.


  -version

Function: Shows version information and exits.


Running

./cc-metric-store                              # Uses ./config.json
./cc-metric-store -config /path/to/config.json # Custom config path
./cc-metric-store -dev                         # Enable Swagger UI at /swagger/
./cc-metric-store -loglevel debug              # Verbose logging

Example Configuration

See Configuration Reference for detailed descriptions of all options.

{
  "main": {
    "addr": "localhost:8080",
    "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
  },
  "metrics": {
    "clock": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_idle": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_iowait": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_irq": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_system": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_user": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "acc_utilization": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "acc_mem_used": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "acc_power": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_any": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_dp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "flops_sp": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "ib_recv": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "ib_xmit": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "cpu_power": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "mem_power": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "ipc": {
      "frequency": 60,
      "aggregation": "avg"
    },
    "cpu_load": {
      "frequency": 60,
      "aggregation": null
    },
    "mem_bw": {
      "frequency": 60,
      "aggregation": "sum"
    },
    "mem_used": {
      "frequency": 60,
      "aggregation": null
    }
  },
  "metric-store": {
    "checkpoints": {
      "interval": "12h",
      "directory": "./var/checkpoints"
    },
    "memory-cap": 100,
    "retention-in-memory": "48h",
    "cleanup": {
      "mode": "archive",
      "interval": "48h",
      "directory": "./var/archive"
    }
  }
}

2 - Configuration

ClusterCockpit Metric Store Configuration Option References

Configuration options are located in a JSON file. Default path is config.json in current working directory. Alternative paths to the configuration file can be specified using the command line switch -config <filename>.

All durations are specified as string that will be parsed like this (Allowed suffixes: s, m, h, …).

The configuration is organized into four main sections: main, metrics, nats, and metric-store.

Main Section

  • main: Server configuration (required)
    • addr: Address to bind to, for example localhost:8080 or 0.0.0.0:443 (required)
    • https-cert-file: Filepath to SSL certificate. If also https-key-file is set, use HTTPS (optional)
    • https-key-file: Filepath to SSL key file. If also https-cert-file is set, use HTTPS (optional)
    • user: Drop root permissions to this user once the port was bound. Only applicable if using privileged port (optional)
    • group: Drop root permissions to this group once the port was bound. Only applicable if using privileged port (optional)
    • backend-url: URL of cc-backend for querying job information, e.g., https://localhost:8080 (optional)
    • jwt-public-key: Base64 encoded Ed25519 public key, use this to verify requests to the HTTP API (required)
    • debug: Debug options (optional)
      • dump-to-file: Path to file for dumping internal state (optional)
      • gops: Enable gops agent for debugging (optional)

Metrics Section

  • metrics: Map of metric-name to objects with the following properties (required)
    • frequency: Timestep/Interval/Resolution of this metric in seconds (required)
    • aggregation: Can be "sum", "avg" or null (required)
      • null means aggregation across topology levels is disabled for this metric (use for node-scope-only metrics)
      • "sum" means that values from the child levels are summed up for the parent level
      • "avg" means that values from the child levels are averaged for the parent level

NATS Section

  • nats: NATS server connection configuration (optional)
    • address: URL of NATS.io server, example: nats://localhost:4222 (required if nats section present)
    • username: NATS username for authentication (optional)
    • password: NATS password for authentication (optional)

Metric-Store Section

  • metric-store: Storage engine configuration (required)
    • checkpoints: Checkpoint configuration (required)
      • interval: Create checkpoints every X seconds/minutes/hours (required)
      • directory: Path to checkpoint directory (required)
    • retention-in-memory: Keep all values in memory for at least that amount of time. Should be long enough to cover common job durations (required)
    • memory-cap: Maximum percentage of system memory to use (optional)
    • cleanup: Cleanup/archiving configuration (required)
      • mode: Either "archive" (move and compress old checkpoints) or "delete" (remove old checkpoints) (required)
      • interval: Perform cleanup every X seconds/minutes/hours (required)
      • directory: Path to archive directory (required if mode is "archive")
    • nats-subscriptions: Array of NATS subscription configurations (optional, requires nats section)
      • subscribe-to: NATS subject to subscribe to (required)
      • cluster-tag: Default cluster tag for incoming metrics (required)

3 - Metric Store REST API

ClusterCockpit Metric Store RESTful API Endpoint description

Authentication

JWT tokens

cc-metric-store supports only JWT tokens using the EdDSA/Ed25519 signing method. The token is provided using the Authorization Bearer header.

Example script to test the endpoint:

# Only use JWT token if the JWT authentication has been setup
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"

curl -X 'GET' 'http://localhost:8080/api/query/' -H "Authorization: Bearer $JWT" \
  -d '{ "cluster": "alex", "from": 1720879275, "to": 1720964715, "queries": [{"metric": "cpu_load","host": "a0124"}] }'

NATS

As an alternative to the REST API, cc-metric-store can receive metrics via NATS messaging. See the NATS configuration for setup details.

Usage of Swagger UI

The Swagger UI is available as part of cc-metric-store if you start it with the -dev option:

./cc-metric-store -dev

You may access it at http://localhost:8080/swagger/ (adjust port to match your main.addr configuration).

API Endpoints

The following REST endpoints are available:

EndpointMethodDescription
/api/query/GET/POSTQuery metrics with selectors
/api/write/POSTWrite metrics (InfluxDB line protocol)
/api/free/POSTFree buffers up to timestamp
/api/debug/GETDump internal state (debugging)
/api/healthcheck/GETNode health status

Payload format for write endpoint

The data comes in InfluxDB line protocol format.

<metric>,cluster=<cluster>,hostname=<hostname>,type=<node/hwthread/etc> value=<value> <epoch_time_in_ns_or_s>

Real example:

proc_run,cluster=fritz,hostname=f2163,type=node value=4i 1725620476214474893

A more detailed description of the ClusterCockpit flavored InfluxDB line protocol and their types can be found here in CC specification.

Example script to test endpoint:

# Only use JWT token if the JWT authentication has been setup
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"

curl -X 'POST' 'http://localhost:8080/api/write/' -H "Authorization: Bearer $JWT" \
  -d "proc_run,cluster=fritz,hostname=f2163,type=node value=4i 1725620476214474893"

Testing with the Metric Generator

For comprehensive testing of the write endpoint, a Metric Generator Script is available. This script simulates high-frequency metric data and supports both REST and NATS transport modes, as well as internal (integrated into cc-backend) and external (standalone) cc-metric-store deployments.

Swagger API Reference