This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

cc-slurm-adapter

ClusterCockpit Slurm Adapter References

Reference information regarding the ClusterCockpit component “cc-slurm-adapter” (GitHub Repo).

Overview

cc-slurm-adapter is a software daemon that feeds cc-backend with job information from Slurm in realtime.

Key Features

  • Fault Tolerant: Handles cc-backend or Slurm downtime gracefully without losing jobs
  • Automatic Recovery: Submits jobs to cc-backend as soon as services are available again
  • Realtime Updates: Supports immediate job notification via Slurm Prolog/Epilog hooks
  • NATS Integration: Optional job notification messaging via NATS
  • Minimal Dependencies: Uses Slurm commands (sacct, squeue, sacctmgr, scontrol) - no slurmrestd required

Architecture

The daemon runs on the same node as slurmctld and operates in two modes:

  1. Daemon Mode: Periodic synchronization (default: every 60 seconds) between Slurm and cc-backend
  2. Prolog/Epilog Mode: Immediate trigger on job start/stop events (optional, reduces latency)

Data is submitted to cc-backend via REST API. Note: Slurm’s slurmdbd is mandatory.

Limitations

Resource Information Availability

Because slurmdbd does not store all job information, some details may be unavailable in certain cases:

  • Resource allocation information is obtained via scontrol --cluster XYZ show job XYZ --json
  • This information becomes unavailable a few minutes after job completion
  • If the daemon is stopped for too long, jobs may lack resource information
  • Critical Impact: Without resource information, cc-backend cannot associate jobs with metrics (CPU, GPU, memory)
  • Jobs will still be listed in cc-backend but metric visualization will not work

Slurm Version Compatibility

Supported Versions

These Slurm versions are known to work:

  • 24.xx.x
  • 25.xx.x

Compatibility Notes

All Slurm-related code is concentrated in slurm.go for easier maintenance. The most common compatibility issue is nil pointer dereference due to missing JSON fields.

Debugging Incompatibilities

If you encounter nil pointer dereferences:

  1. Get a job ID via squeue or sacct

  2. Check JSON layouts from both commands (they differ):

    sacct -j 12345 --json
    scontrol show job 12345 --json
    

SlurmInt and SlurmString Types

Slurm has been transitioning API formats:

  • SlurmInt: Handles both plain integers and Slurm’s “infinite/set” struct format
  • SlurmString: Handles both plain strings and string arrays (uses first element if array, blank if empty)

These custom types maintain backward compatibility across Slurm versions.

1 - Installation

Installing and building cc-slurm-adapter

Prerequisites

  • Go 1.24.0 or higher
  • Slurm with slurmdbd configured
  • cc-backend instance with API access
  • Access to the slurmctld node

Building from Source

Requirements

go 1.24.0+

Dependencies

Key dependencies (managed via go.mod):

  • github.com/ClusterCockpit/cc-lib - ClusterCockpit common library
  • github.com/nats-io/nats.go - NATS client

Compilation

make

This creates the cc-slurm-adapter binary.

Build Commands

# Build binary
make

# Format code
make format

# Clean build artifacts
make clean

2 - cc-slurm-adapter Configuration

cc-slurm-adapter configuration reference

Configuration File Location

Default: /etc/cc-slurm-adapter/config.json

Example Configuration

{
  "pidFilePath": "/run/cc-slurm-adapter/daemon.pid",
  "prepSockListenPath": "/run/cc-slurm-adapter/daemon.sock",
  "prepSockConnectPath": "/run/cc-slurm-adapter/daemon.sock",
  "lastRunPath": "/var/lib/cc-slurm-adapter/last_run",
  "slurmPollInterval": 60,
  "slurmQueryDelay": 1,
  "slurmQueryMaxSpan": 604800,
  "slurmQueryMaxRetries": 5,
  "ccPollInterval": 21600,
  "ccRestSubmitJobs": true,
  "ccRestUrl": "https://my-cc-backend-instance.example",
  "ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "gpuPciAddrs": {
    "^nodehostname0[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"],
    "^nodehostname1[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"]
  },
  "ignoreHosts": "^nodehostname9\\w+$",
  "natsServer": "mynatsserver.example",
  "natsPort": 4222,
  "natsSubject": "mysubject",
  "natsUser": "myuser",
  "natsPassword": "123456789",
  "natsCredsFile": "/etc/cc-slurm-adapter/nats.creds",
  "natsNKeySeedFile": "/etc/ss-slurm-adapter/nats.nkey"
}

Configuration Reference

Required Settings

Config KeyTypeDescription
ccRestUrlstringURL to cc-backend’s REST API (must not contain trailing slash)
ccRestJwtstringJWT token from cc-backend for REST API access

Daemon Settings

Config KeyTypeDefaultDescription
pidFilePathstring/run/cc-slurm-adapter/daemon.pidPath to PID file (prevents concurrent execution)
lastRunPathstring/var/lib/cc-slurm-adapter/lastrunPath to file storing last successful sync timestamp (as file mtime)

Socket Settings

Config KeyTypeDefaultDescription
prepSockListenPathstring/run/cc-slurm-adapter/daemon.sockSocket for daemon to receive prolog/epilog events. Supports UNIX and TCP formats (see below)
prepSockConnectPathstring/run/cc-slurm-adapter/daemon.sockSocket for prolog/epilog mode to connect to daemon

Socket Formats:

  • UNIX: /run/cc-slurm-adapter/daemon.sock or unix:/run/cc-slurm-adapter/daemon.sock
  • TCP IPv4: tcp:127.0.0.1:12345 or tcp:0.0.0.0:12345
  • TCP IPv6: tcp:[::1]:12345, tcp:[::]:12345, tcp::12345

Slurm Polling Settings

Config KeyTypeDefaultDescription
slurmPollIntervalint60Interval (seconds) for periodic sync to cc-backend
slurmQueryDelayint1Wait time (seconds) after prolog/epilog event before querying Slurm
slurmQueryMaxSpanint604800Maximum time (seconds) to query jobs from the past (prevents flooding)
slurmQueryMaxRetriesint10Maximum Slurm query attempts on Prolog/Epilog events

cc-backend Settings

Config KeyTypeDefaultDescription
ccPollIntervalint21600Interval (seconds) to query all jobs from cc-backend (prevents stuck jobs)
ccRestSubmitJobsbooltrueSubmit started/stopped jobs to cc-backend via REST (set false if using NATS-only)

Hardware Mapping

Config KeyTypeDefaultDescription
gpuPciAddrsobject{}Map of hostname regexes to GPU PCI address arrays (must match NVML/nvidia-smi order)
ignoreHostsstring""Regex of hostnames to ignore (jobs only on matching hosts are discarded)

NATS Settings

Config KeyTypeDefaultDescription
natsServerstring""NATS server hostname (leave blank to disable NATS)
natsPortuint164222NATS server port
natsSubjectstring"jobs"Subject to publish job information to
natsUserstring""NATS username (for user auth)
natsPasswordstring""NATS password
natsCredsFilestring""Path to NATS credentials file
natsNKeySeedFilestring""Path to NATS NKey seed file (private key)

Note: The deprecated ipcSockPath option has been removed. Use prepSockListenPath and prepSockConnectPath instead.

3 - Daemon Setup

Setting up cc-slurm-adapter as a daemon

The daemon mode is required for cc-slurm-adapter to function. This page describes how to set up the daemon using systemd.

1. Copy Binary and Configuration

Copy the binary and create a configuration file:

sudo mkdir -p /opt/cc-slurm-adapter
sudo cp cc-slurm-adapter /opt/cc-slurm-adapter/
sudo cp config.json /opt/cc-slurm-adapter/

Security: The config file contains sensitive credentials (JWT, NATS). Set appropriate permissions:

sudo chmod 600 /opt/cc-slurm-adapter/config.json

2. Create System User

sudo useradd -r -s /bin/false cc-slurm-adapter
sudo chown -R cc-slurm-adapter:slurm /opt/cc-slurm-adapter

3. Grant Slurm Permissions

The adapter user needs permission to query Slurm:

sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator

Critical: If permissions are not set and Slurm is restricted, NO JOBS WILL BE REPORTED.

4. Install systemd Service

Create /etc/systemd/system/cc-slurm-adapter.service:

[Unit]
Description=cc-slurm-adapter
Wants=network.target
After=network.target

[Service]
User=cc-slurm-adapter
Group=slurm
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -config /opt/cc-slurm-adapter/config.json
WorkingDirectory=/opt/cc-slurm-adapter/
RuntimeDirectory=cc-slurm-adapter
RuntimeDirectoryMode=0750
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

Notes:

  • RuntimeDirectory creates /run/cc-slurm-adapter for PID and socket files
  • Group=slurm allows Prolog/Epilog (running as slurm user) to access the socket
  • RuntimeDirectoryMode=0750 enables group access

5. Enable and Start Service

sudo systemctl daemon-reload
sudo systemctl enable cc-slurm-adapter
sudo systemctl start cc-slurm-adapter

Verification

Check that the service is running:

sudo systemctl status cc-slurm-adapter

You should see output indicating the service is active and running.

4 - Prolog/Epilog Hooks

Setting up Prolog/Epilog hooks for immediate job notification

Prolog/Epilog hook setup is optional but recommended for immediate job notification, which reduces latency compared to relying solely on periodic polling.

Prerequisites

  • Daemon must be running (see Daemon Setup)
  • Hook script must be accessible from slurmctld
  • Hook script must exit with code 0 to avoid rejecting job allocations

1. Create Hook Script

Create /opt/cc-slurm-adapter/hook.sh:

#!/bin/sh
/opt/cc-slurm-adapter/cc-slurm-adapter
exit 0

Make it executable:

sudo chmod +x /opt/cc-slurm-adapter/hook.sh

Important: Always exit with 0. Non-zero exit codes will reject job allocations.

2. Configure Slurm

Add to slurm.conf:

PrEpPlugins=prep/script
PrologSlurmctld=/opt/cc-slurm-adapter/hook.sh
EpilogSlurmctld=/opt/cc-slurm-adapter/hook.sh

3. Restart slurmctld

sudo systemctl restart slurmctld

Note: If using non-default socket path, add -config /path/to/config.json to hook.sh. The config file must be readable by the slurm user/group.

Multi-Cluster Setup

For multiple slurmctld nodes, use TCP sockets instead of UNIX sockets:

{
  "prepSockListenPath": "tcp:0.0.0.0:12345",
  "prepSockConnectPath": "tcp:slurmctld-host:12345"
}

This allows Prolog/Epilog hooks on different nodes to connect to the daemon over the network.

How It Works

  1. Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
  2. Socket Message: Hook sends job ID to daemon via socket
  3. Immediate Query: Daemon queries Slurm for that specific job
  4. Fast Submission: Job submitted to cc-backend with minimal delay

This reduces the job notification latency from up to 60 seconds (default poll interval) to just a few seconds.

5 - Usage

Command line usage and operation modes

Command Line Flags

FlagDescription
-config <path>Specify the path to the config file (default: /etc/cc-slurm-adapter/config.json)
-daemonRun in daemon mode (if omitted, runs in Prolog/Epilog mode)
-debug <log-level>Set the log level (default: 2, max: 5)
-helpShow help for all command line flags

Operation Modes

Daemon Mode

Run the adapter as a persistent daemon that periodically synchronizes job information:

cc-slurm-adapter -daemon -config /path/to/config.json

This mode:

  • Runs continuously in the background
  • Queries Slurm at regular intervals (default: 60 seconds)
  • Submits job information to cc-backend
  • Should be managed by systemd (see Daemon Setup)

Prolog/Epilog Mode

Run the adapter from Slurm’s Prolog/Epilog hooks for immediate job notification:

cc-slurm-adapter

This mode:

  • Only runs when triggered by Slurm (job start/stop)
  • Sends job ID to the running daemon via socket
  • Exits immediately
  • Must be invoked from Slurm hook scripts (see Prolog/Epilog Setup)

Best Practices

Production Deployment

  1. Keep Daemon Running: Resource info expires quickly after job completion
  2. Monitor Logs: Watch for Slurm API changes or nil pointer errors
  3. Secure Credentials: Restrict config file permissions (600 or 640)
  4. Use Prolog/Epilog Carefully: Always exit with 0 to avoid blocking job allocations
  5. Test Before Production: Verify in development environment first

Performance Tuning

  • High Job Volume: Reduce slurmPollInterval if periodic sync causes lag
  • Low Latency Required: Enable Prolog/Epilog hooks
  • Resource Constrained: Increase ccPollInterval (reduces cc-backend queries)

Debug Logging

Enable verbose logging for troubleshooting:

cc-slurm-adapter -daemon -debug 5 -config /path/to/config.json

Log Levels:

  • 2 (default): Errors and warnings
  • 5 (max): Verbose debug output

For systemd services, edit the service file to add -debug 5 to the ExecStart line.

6 - Troubleshooting

Debugging and common issues

Check Service Status

Verify the daemon is running:

sudo systemctl status cc-slurm-adapter

You should see output indicating the service is active (running).

View Logs

cc-slurm-adapter logs to stderr (captured by systemd):

sudo journalctl -u cc-slurm-adapter -f

Use -f to follow logs in real-time, or omit it to view historical logs.

Enable Debug Logging

Edit the systemd service file to add -debug 5:

ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -debug 5 -config /opt/cc-slurm-adapter/config.json

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart cc-slurm-adapter

Log Levels:

  • 2 (default): Errors and warnings
  • 5 (max): Verbose debug output

Common Issues

IssuePossible CauseSolution
No jobs reportedMissing Slurm permissionsRun sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator
Socket connection errorsWrong socket path or permissionsCheck prepSockListenPath/prepSockConnectPath and RuntimeDirectoryMode
Prolog/Epilog failuresNon-zero exit code in hook scriptEnsure hook script exits with exit 0
Missing resource infoDaemon stopped too longKeep daemon running; resource info expires minutes after job completion
Job allocation failuresProlog/Epilog exit code ≠ 0Check hook script and ensure cc-slurm-adapter is running

Debugging Slurm Compatibility Issues

If you encounter nil pointer dereferences or unexpected errors:

  1. Get a job ID via squeue or sacct:

    squeue
    # or
    sacct
    
  2. Check JSON layouts from both commands (they differ):

    sacct -j 12345 --json
    scontrol show job 12345 --json
    
  3. Compare the output with what the adapter expects in slurm.go

  4. Report issues to the GitHub repository with:

    • Slurm version
    • JSON output samples
    • Error messages from logs

Verifying Configuration

Check that your configuration is valid:

# Test if config file is readable
cat /opt/cc-slurm-adapter/config.json

# Verify JSON syntax
jq . /opt/cc-slurm-adapter/config.json

Testing Connectivity

Test cc-backend Connection

# Test REST API endpoint (replace with your JWT)
curl -H "Authorization: Bearer YOUR_JWT_TOKEN" \
     https://your-cc-backend-instance.example/api/jobs/

Test NATS Connection

If using NATS, verify connectivity:

# Using nats-cli (if installed)
nats server check -s nats://mynatsserver.example:4222

Performance Issues

If the adapter is slow or missing jobs:

  1. Check Slurm Response Times: Run sacct and squeue manually to see if Slurm is responding slowly
  2. Adjust Poll Intervals: Lower slurmPollInterval for more frequent checks (but higher load)
  3. Enable Prolog/Epilog: Reduces dependency on polling for immediate job notification
  4. Check System Resources: Ensure adequate CPU/memory on the slurmctld node

7 - Architecture

Technical architecture and internal details

Synchronization Flow

The daemon operates on a periodic synchronization cycle:

  1. Timer Trigger: Periodic timer (default: 60s) triggers sync
  2. Query Slurm: Fetch job data via sacct, squeue, scontrol
  3. Submit to cc-backend: POST job start/stop via REST API
  4. Publish to NATS: Optional notification message (if enabled)

This ensures that all jobs are eventually captured, even if Prolog/Epilog hooks fail or are not configured.

Prolog/Epilog Flow

When Prolog/Epilog hooks are enabled, immediate job notification works as follows:

  1. Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
  2. Socket Message: Hook sends job ID to daemon via socket
  3. Immediate Query: Daemon queries Slurm for that specific job
  4. Fast Submission: Job submitted to cc-backend with minimal delay

This reduces latency from up to 60 seconds (default poll interval) to just a few seconds.

Data Sources

The adapter queries multiple Slurm commands to build complete job information:

Slurm CommandPurpose
sacctHistorical job accounting data
squeueCurrent job queue information
scontrol show jobResource allocation details (JSON format)
sacctmgrUser permissions

Important: scontrol show job provides critical resource allocation information (nodes, CPUs, GPUs) that is only available while the job is in memory. This information typically expires a few minutes after job completion, which is why keeping the daemon running continuously is essential.

State Persistence

The adapter maintains minimal state on disk:

  • Last Run Timestamp: Stored as file modification time in lastRunPath

    • Used to determine which jobs to query on startup
    • Prevents flooding cc-backend with historical jobs after restarts
  • PID File: Stored in pidFilePath

    • Prevents concurrent daemon execution
    • Automatically cleaned up on graceful shutdown
  • Socket: IPC between daemon and Prolog/Epilog instances

    • Created at prepSockListenPath (daemon listens)
    • Connected at prepSockConnectPath (Prolog/Epilog connects)
    • Supports both UNIX domain sockets and TCP sockets

Fault Tolerance

The adapter is designed to be fault-tolerant:

Slurm Downtime

  • Retries Slurm queries with exponential backoff
  • Continues operation once Slurm becomes available
  • No job loss during Slurm restarts

cc-backend Downtime

  • Queues jobs internally (up to slurmQueryMaxSpan seconds in the past)
  • Submits queued jobs once cc-backend is available
  • Prevents duplicate submissions

Daemon Restarts

  • Uses lastRunPath timestamp to catch up on missed jobs
  • Limited by slurmQueryMaxSpan to prevent overwhelming the system
  • Resource allocation data may be lost for jobs that completed while daemon was down

Multi-Cluster Considerations

For environments with multiple Slurm clusters:

  • Run one daemon instance per slurmctld node
  • Use cluster-specific configuration files
  • Consider TCP sockets for Prolog/Epilog if slurmctld is not on compute nodes

Performance Characteristics

Resource Usage

  • Memory: Minimal (< 50 MB typical)
  • CPU: Low (periodic bursts during synchronization)
  • Network: Moderate (REST API calls to cc-backend, NATS if enabled)

Scalability

  • Tested with clusters of 1000+ nodes
  • Handle thousands of jobs per day
  • Poll interval can be tuned based on job submission rate

Latency

  • Without Prolog/Epilog: Up to slurmPollInterval seconds (default: 60s)
  • With Prolog/Epilog: Typically < 5 seconds

8 - API Integration

Integration with cc-backend and NATS

cc-backend REST API

The adapter communicates with cc-backend using its REST API to submit job information.

Configuration

Set these required configuration options:

{
  "ccRestUrl": "https://my-cc-backend-instance.example",
  "ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "ccRestSubmitJobs": true
}
  • ccRestUrl: URL to cc-backend’s REST API (must not contain trailing slash)
  • ccRestJwt: JWT token from cc-backend for REST API access
  • ccRestSubmitJobs: Enable/disable REST API submissions (default: true)

Endpoints Used

The adapter uses the following cc-backend API endpoints:

EndpointMethodPurpose
/api/jobs/start_job/POSTSubmit job start event
/api/jobs/stop_job/<jobId>POSTSubmit job completion event

Authentication

All API requests include a JWT bearer token in the Authorization header:

Authorization: Bearer <ccRestJwt>

Job Data Format

Jobs are submitted in ClusterCockpit’s job metadata format, including:

  • Job ID and cluster name
  • User and project information
  • Start and stop times
  • Resource allocation (nodes, CPUs, GPUs)
  • Job state and exit code

Error Handling

  • Connection Errors: Adapter retries with exponential backoff
  • Authentication Errors: Logged as errors; check JWT token validity
  • Validation Errors: Logged with details about invalid fields

NATS Messaging

NATS integration is optional and provides real-time job notifications to other services.

Configuration

{
  "natsServer": "mynatsserver.example",
  "natsPort": 4222,
  "natsSubject": "mysubject",
  "natsUser": "myuser",
  "natsPassword": "123456789"
}

Leave natsServer empty to disable NATS integration.

Authentication Methods

The adapter supports multiple NATS authentication methods:

1. Username/Password

{
  "natsUser": "myuser",
  "natsPassword": "mypassword"
}

See: NATS Username/Password Auth

2. Credentials File

{
  "natsCredsFile": "/etc/cc-slurm-adapter/nats.creds"
}

See: NATS Credentials File

3. NKey Authentication

{
  "natsNKeySeedFile": "/etc/cc-slurm-adapter/nats.nkey"
}

See: NATS NKey Auth

Message Format

Jobs are published as JSON messages to the configured subject:

{
  "jobId": "12345",
  "cluster": "mycluster",
  "user": "username",
  "project": "projectname",
  "startTime": 1234567890,
  "stopTime": 1234567900,
  "numNodes": 4,
  "resources": { ... }
}

Use Cases

NATS integration is useful for:

  • Real-time Monitoring: Other services can subscribe to job events
  • Event-Driven Workflows: Trigger actions when jobs start/stop
  • Alternative to REST: Can disable REST submission and use NATS-only
  • Multi-Component Architecture: Multiple services consuming job events

Performance Considerations

  • NATS adds minimal latency (typically < 1ms)
  • Messages are fire-and-forget (no delivery guarantees by default)
  • Consider using NATS JetStream for persistent queues if needed

Dual Submission Mode

By default, the adapter submits jobs to both cc-backend REST API and NATS:

{
  "ccRestSubmitJobs": true,
  "natsServer": "mynatsserver.example"
}

This ensures:

  • cc-backend receives authoritative job data
  • Other services can react to job events in real-time

NATS-Only Mode

For specialized deployments, you can disable REST submission:

{
  "ccRestSubmitJobs": false,
  "natsServer": "mynatsserver.example"
}

Warning: In this mode, you must ensure another component (e.g., a NATS subscriber) is forwarding job data to cc-backend, or jobs will not appear in the UI.