cc-slurm-adapter
ClusterCockpit Slurm Adapter References
Reference information regarding the ClusterCockpit component “cc-slurm-adapter” (GitHub Repo).
Overview
cc-slurm-adapter is a software daemon that feeds
cc-backend with job information
from Slurm in realtime.
Key Features
- Fault Tolerant: Handles cc-backend or Slurm downtime gracefully without losing jobs
- Automatic Recovery: Submits jobs to cc-backend as soon as services are available again
- Realtime Updates: Supports immediate job notification via Slurm Prolog/Epilog hooks
- NATS Integration: Optional job notification messaging via NATS
- Minimal Dependencies: Uses Slurm commands (
sacct, squeue, sacctmgr, scontrol) - no slurmrestd required
Architecture
The daemon runs on the same node as
slurmctld and operates in two modes:
- Daemon Mode: Periodic synchronization (default: every 60 seconds) between
Slurm and cc-backend
- Prolog/Epilog Mode: Immediate trigger on job start/stop events (optional,
reduces latency)
Data is submitted to cc-backend via REST API. Note: Slurm’s slurmdbd is mandatory.
Limitations
Because slurmdbd does not store all job information, some details may be
unavailable in certain cases:
- Resource allocation information is obtained via
scontrol --cluster XYZ show job XYZ --json - This information becomes unavailable a few minutes after job completion
- If the daemon is stopped for too long, jobs may lack resource information
- Critical Impact: Without resource information, cc-backend cannot associate jobs with metrics (CPU, GPU, memory)
- Jobs will still be listed in cc-backend but metric visualization will not work
Slurm Version Compatibility
Supported Versions
These Slurm versions are known to work:
Compatibility Notes
All Slurm-related code is concentrated in slurm.go for easier maintenance. The
most common compatibility issue is nil pointer dereference due to missing
JSON fields.
Debugging Incompatibilities
If you encounter nil pointer dereferences:
Get a job ID via squeue or sacct
Check JSON layouts from both commands (they differ):
sacct -j 12345 --json
scontrol show job 12345 --json
SlurmInt and SlurmString Types
Slurm has been transitioning API formats:
- SlurmInt: Handles both plain integers and Slurm’s “infinite/set” struct format
- SlurmString: Handles both plain strings and string arrays (uses first element if array, blank if empty)
These custom types maintain backward compatibility across Slurm versions.
Links
1 - Installation
Installing and building cc-slurm-adapter
Prerequisites
- Go 1.24.0 or higher
- Slurm with slurmdbd configured
- cc-backend instance with API access
- Access to the slurmctld node
Building from Source
Requirements
go 1.24.0+
Dependencies
Key dependencies (managed via go.mod):
github.com/ClusterCockpit/cc-lib - ClusterCockpit common librarygithub.com/nats-io/nats.go - NATS client
Compilation
This creates the cc-slurm-adapter binary.
Build Commands
# Build binary
make
# Format code
make format
# Clean build artifacts
make clean
2 - cc-slurm-adapter Configuration
cc-slurm-adapter configuration reference
Configuration File Location
Default: /etc/cc-slurm-adapter/config.json
Example Configuration
{
"pidFilePath": "/run/cc-slurm-adapter/daemon.pid",
"prepSockListenPath": "/run/cc-slurm-adapter/daemon.sock",
"prepSockConnectPath": "/run/cc-slurm-adapter/daemon.sock",
"lastRunPath": "/var/lib/cc-slurm-adapter/last_run",
"slurmPollInterval": 60,
"slurmQueryDelay": 1,
"slurmQueryMaxSpan": 604800,
"slurmQueryMaxRetries": 5,
"ccPollInterval": 21600,
"ccRestSubmitJobs": true,
"ccRestUrl": "https://my-cc-backend-instance.example",
"ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"gpuPciAddrs": {
"^nodehostname0[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"],
"^nodehostname1[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"]
},
"ignoreHosts": "^nodehostname9\\w+$",
"natsServer": "mynatsserver.example",
"natsPort": 4222,
"natsSubject": "mysubject",
"natsUser": "myuser",
"natsPassword": "123456789",
"natsCredsFile": "/etc/cc-slurm-adapter/nats.creds",
"natsNKeySeedFile": "/etc/ss-slurm-adapter/nats.nkey"
}
Configuration Reference
Required Settings
| Config Key | Type | Description |
|---|
ccRestUrl | string | URL to cc-backend’s REST API (must not contain trailing slash) |
ccRestJwt | string | JWT token from cc-backend for REST API access |
Daemon Settings
| Config Key | Type | Default | Description |
|---|
pidFilePath | string | /run/cc-slurm-adapter/daemon.pid | Path to PID file (prevents concurrent execution) |
lastRunPath | string | /var/lib/cc-slurm-adapter/lastrun | Path to file storing last successful sync timestamp (as file mtime) |
Socket Settings
| Config Key | Type | Default | Description |
|---|
prepSockListenPath | string | /run/cc-slurm-adapter/daemon.sock | Socket for daemon to receive prolog/epilog events. Supports UNIX and TCP formats (see below) |
prepSockConnectPath | string | /run/cc-slurm-adapter/daemon.sock | Socket for prolog/epilog mode to connect to daemon |
Socket Formats:
- UNIX:
/run/cc-slurm-adapter/daemon.sock or unix:/run/cc-slurm-adapter/daemon.sock - TCP IPv4:
tcp:127.0.0.1:12345 or tcp:0.0.0.0:12345 - TCP IPv6:
tcp:[::1]:12345, tcp:[::]:12345, tcp::12345
Slurm Polling Settings
| Config Key | Type | Default | Description |
|---|
slurmPollInterval | int | 60 | Interval (seconds) for periodic sync to cc-backend |
slurmQueryDelay | int | 1 | Wait time (seconds) after prolog/epilog event before querying Slurm |
slurmQueryMaxSpan | int | 604800 | Maximum time (seconds) to query jobs from the past (prevents flooding) |
slurmQueryMaxRetries | int | 10 | Maximum Slurm query attempts on Prolog/Epilog events |
cc-backend Settings
| Config Key | Type | Default | Description |
|---|
ccPollInterval | int | 21600 | Interval (seconds) to query all jobs from cc-backend (prevents stuck jobs) |
ccRestSubmitJobs | bool | true | Submit started/stopped jobs to cc-backend via REST (set false if using NATS-only) |
Hardware Mapping
| Config Key | Type | Default | Description |
|---|
gpuPciAddrs | object | {} | Map of hostname regexes to GPU PCI address arrays (must match NVML/nvidia-smi order) |
ignoreHosts | string | "" | Regex of hostnames to ignore (jobs only on matching hosts are discarded) |
NATS Settings
| Config Key | Type | Default | Description |
|---|
natsServer | string | "" | NATS server hostname (leave blank to disable NATS) |
natsPort | uint16 | 4222 | NATS server port |
natsSubject | string | "jobs" | Subject to publish job information to |
natsUser | string | "" | NATS username (for user auth) |
natsPassword | string | "" | NATS password |
natsCredsFile | string | "" | Path to NATS credentials file |
natsNKeySeedFile | string | "" | Path to NATS NKey seed file (private key) |
Note: The deprecated ipcSockPath option has been removed. Use prepSockListenPath and prepSockConnectPath instead.
3 - Daemon Setup
Setting up cc-slurm-adapter as a daemon
The daemon mode is required for cc-slurm-adapter to function. This page describes how to set up the daemon using systemd.
1. Copy Binary and Configuration
Copy the binary and create a configuration file:
sudo mkdir -p /opt/cc-slurm-adapter
sudo cp cc-slurm-adapter /opt/cc-slurm-adapter/
sudo cp config.json /opt/cc-slurm-adapter/
Security: The config file contains sensitive credentials (JWT, NATS). Set appropriate permissions:
sudo chmod 600 /opt/cc-slurm-adapter/config.json
2. Create System User
sudo useradd -r -s /bin/false cc-slurm-adapter
sudo chown -R cc-slurm-adapter:slurm /opt/cc-slurm-adapter
3. Grant Slurm Permissions
The adapter user needs permission to query Slurm:
sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator
Critical: If permissions are not set and Slurm is restricted, NO JOBS WILL BE REPORTED.
4. Install systemd Service
Create /etc/systemd/system/cc-slurm-adapter.service:
[Unit]
Description=cc-slurm-adapter
Wants=network.target
After=network.target
[Service]
User=cc-slurm-adapter
Group=slurm
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -config /opt/cc-slurm-adapter/config.json
WorkingDirectory=/opt/cc-slurm-adapter/
RuntimeDirectory=cc-slurm-adapter
RuntimeDirectoryMode=0750
Restart=on-failure
RestartSec=15s
[Install]
WantedBy=multi-user.target
Notes:
RuntimeDirectory creates /run/cc-slurm-adapter for PID and socket filesGroup=slurm allows Prolog/Epilog (running as slurm user) to access the socketRuntimeDirectoryMode=0750 enables group access
5. Enable and Start Service
sudo systemctl daemon-reload
sudo systemctl enable cc-slurm-adapter
sudo systemctl start cc-slurm-adapter
Verification
Check that the service is running:
sudo systemctl status cc-slurm-adapter
You should see output indicating the service is active and running.
4 - Prolog/Epilog Hooks
Setting up Prolog/Epilog hooks for immediate job notification
Prolog/Epilog hook setup is optional but recommended for immediate job notification, which reduces latency compared to relying solely on periodic polling.
Prerequisites
- Daemon must be running (see Daemon Setup)
- Hook script must be accessible from slurmctld
- Hook script must exit with code 0 to avoid rejecting job allocations
1. Create Hook Script
Create /opt/cc-slurm-adapter/hook.sh:
#!/bin/sh
/opt/cc-slurm-adapter/cc-slurm-adapter
exit 0
Make it executable:
sudo chmod +x /opt/cc-slurm-adapter/hook.sh
Important: Always exit with 0. Non-zero exit codes will reject job allocations.
Add to slurm.conf:
PrEpPlugins=prep/script
PrologSlurmctld=/opt/cc-slurm-adapter/hook.sh
EpilogSlurmctld=/opt/cc-slurm-adapter/hook.sh
3. Restart slurmctld
sudo systemctl restart slurmctld
Note: If using non-default socket path, add -config /path/to/config.json to hook.sh. The config file must be readable by the slurm user/group.
Multi-Cluster Setup
For multiple slurmctld nodes, use TCP sockets instead of UNIX sockets:
{
"prepSockListenPath": "tcp:0.0.0.0:12345",
"prepSockConnectPath": "tcp:slurmctld-host:12345"
}
This allows Prolog/Epilog hooks on different nodes to connect to the daemon over the network.
How It Works
- Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
- Socket Message: Hook sends job ID to daemon via socket
- Immediate Query: Daemon queries Slurm for that specific job
- Fast Submission: Job submitted to cc-backend with minimal delay
This reduces the job notification latency from up to 60 seconds (default poll interval) to just a few seconds.
5 - Usage
Command line usage and operation modes
Command Line Flags
| Flag | Description |
|---|
-config <path> | Specify the path to the config file (default: /etc/cc-slurm-adapter/config.json) |
-daemon | Run in daemon mode (if omitted, runs in Prolog/Epilog mode) |
-debug <log-level> | Set the log level (default: 2, max: 5) |
-help | Show help for all command line flags |
Operation Modes
Daemon Mode
Run the adapter as a persistent daemon that periodically synchronizes job information:
cc-slurm-adapter -daemon -config /path/to/config.json
This mode:
- Runs continuously in the background
- Queries Slurm at regular intervals (default: 60 seconds)
- Submits job information to cc-backend
- Should be managed by systemd (see Daemon Setup)
Prolog/Epilog Mode
Run the adapter from Slurm’s Prolog/Epilog hooks for immediate job notification:
This mode:
- Only runs when triggered by Slurm (job start/stop)
- Sends job ID to the running daemon via socket
- Exits immediately
- Must be invoked from Slurm hook scripts (see Prolog/Epilog Setup)
Best Practices
Production Deployment
- Keep Daemon Running: Resource info expires quickly after job completion
- Monitor Logs: Watch for Slurm API changes or nil pointer errors
- Secure Credentials: Restrict config file permissions (600 or 640)
- Use Prolog/Epilog Carefully: Always exit with 0 to avoid blocking job allocations
- Test Before Production: Verify in development environment first
- High Job Volume: Reduce
slurmPollInterval if periodic sync causes lag - Low Latency Required: Enable Prolog/Epilog hooks
- Resource Constrained: Increase
ccPollInterval (reduces cc-backend queries)
Debug Logging
Enable verbose logging for troubleshooting:
cc-slurm-adapter -daemon -debug 5 -config /path/to/config.json
Log Levels:
- 2 (default): Errors and warnings
- 5 (max): Verbose debug output
For systemd services, edit the service file to add -debug 5 to the ExecStart line.
6 - Troubleshooting
Debugging and common issues
Check Service Status
Verify the daemon is running:
sudo systemctl status cc-slurm-adapter
You should see output indicating the service is active (running).
View Logs
cc-slurm-adapter logs to stderr (captured by systemd):
sudo journalctl -u cc-slurm-adapter -f
Use -f to follow logs in real-time, or omit it to view historical logs.
Enable Debug Logging
Edit the systemd service file to add -debug 5:
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -debug 5 -config /opt/cc-slurm-adapter/config.json
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart cc-slurm-adapter
Log Levels:
- 2 (default): Errors and warnings
- 5 (max): Verbose debug output
Common Issues
| Issue | Possible Cause | Solution |
|---|
| No jobs reported | Missing Slurm permissions | Run sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator |
| Socket connection errors | Wrong socket path or permissions | Check prepSockListenPath/prepSockConnectPath and RuntimeDirectoryMode |
| Prolog/Epilog failures | Non-zero exit code in hook script | Ensure hook script exits with exit 0 |
| Missing resource info | Daemon stopped too long | Keep daemon running; resource info expires minutes after job completion |
| Job allocation failures | Prolog/Epilog exit code ≠ 0 | Check hook script and ensure cc-slurm-adapter is running |
Debugging Slurm Compatibility Issues
If you encounter nil pointer dereferences or unexpected errors:
Get a job ID via squeue or sacct:
Check JSON layouts from both commands (they differ):
sacct -j 12345 --json
scontrol show job 12345 --json
Compare the output with what the adapter expects in slurm.go
Report issues to the GitHub repository with:
- Slurm version
- JSON output samples
- Error messages from logs
Verifying Configuration
Check that your configuration is valid:
# Test if config file is readable
cat /opt/cc-slurm-adapter/config.json
# Verify JSON syntax
jq . /opt/cc-slurm-adapter/config.json
Testing Connectivity
Test cc-backend Connection
# Test REST API endpoint (replace with your JWT)
curl -H "Authorization: Bearer YOUR_JWT_TOKEN" \
https://your-cc-backend-instance.example/api/jobs/
Test NATS Connection
If using NATS, verify connectivity:
# Using nats-cli (if installed)
nats server check -s nats://mynatsserver.example:4222
If the adapter is slow or missing jobs:
- Check Slurm Response Times: Run
sacct and squeue manually to see if Slurm is responding slowly - Adjust Poll Intervals: Lower
slurmPollInterval for more frequent checks (but higher load) - Enable Prolog/Epilog: Reduces dependency on polling for immediate job notification
- Check System Resources: Ensure adequate CPU/memory on the slurmctld node
7 - Architecture
Technical architecture and internal details
Synchronization Flow
The daemon operates on a periodic synchronization cycle:
- Timer Trigger: Periodic timer (default: 60s) triggers sync
- Query Slurm: Fetch job data via
sacct, squeue, scontrol - Submit to cc-backend: POST job start/stop via REST API
- Publish to NATS: Optional notification message (if enabled)
This ensures that all jobs are eventually captured, even if Prolog/Epilog hooks fail or are not configured.
Prolog/Epilog Flow
When Prolog/Epilog hooks are enabled, immediate job notification works as follows:
- Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
- Socket Message: Hook sends job ID to daemon via socket
- Immediate Query: Daemon queries Slurm for that specific job
- Fast Submission: Job submitted to cc-backend with minimal delay
This reduces latency from up to 60 seconds (default poll interval) to just a few seconds.
Data Sources
The adapter queries multiple Slurm commands to build complete job information:
| Slurm Command | Purpose |
|---|
sacct | Historical job accounting data |
squeue | Current job queue information |
scontrol show job | Resource allocation details (JSON format) |
sacctmgr | User permissions |
Important: scontrol show job provides critical resource allocation information (nodes, CPUs, GPUs) that is only available while the job is in memory. This information typically expires a few minutes after job completion, which is why keeping the daemon running continuously is essential.
State Persistence
The adapter maintains minimal state on disk:
Last Run Timestamp: Stored as file modification time in lastRunPath
- Used to determine which jobs to query on startup
- Prevents flooding cc-backend with historical jobs after restarts
PID File: Stored in pidFilePath
- Prevents concurrent daemon execution
- Automatically cleaned up on graceful shutdown
Socket: IPC between daemon and Prolog/Epilog instances
- Created at
prepSockListenPath (daemon listens) - Connected at
prepSockConnectPath (Prolog/Epilog connects) - Supports both UNIX domain sockets and TCP sockets
Fault Tolerance
The adapter is designed to be fault-tolerant:
Slurm Downtime
- Retries Slurm queries with exponential backoff
- Continues operation once Slurm becomes available
- No job loss during Slurm restarts
cc-backend Downtime
- Queues jobs internally (up to
slurmQueryMaxSpan seconds in the past) - Submits queued jobs once cc-backend is available
- Prevents duplicate submissions
Daemon Restarts
- Uses
lastRunPath timestamp to catch up on missed jobs - Limited by
slurmQueryMaxSpan to prevent overwhelming the system - Resource allocation data may be lost for jobs that completed while daemon was down
Multi-Cluster Considerations
For environments with multiple Slurm clusters:
- Run one daemon instance per slurmctld node
- Use cluster-specific configuration files
- Consider TCP sockets for Prolog/Epilog if slurmctld is not on compute nodes
Resource Usage
- Memory: Minimal (< 50 MB typical)
- CPU: Low (periodic bursts during synchronization)
- Network: Moderate (REST API calls to cc-backend, NATS if enabled)
Scalability
- Tested with clusters of 1000+ nodes
- Handle thousands of jobs per day
- Poll interval can be tuned based on job submission rate
Latency
- Without Prolog/Epilog: Up to
slurmPollInterval seconds (default: 60s) - With Prolog/Epilog: Typically < 5 seconds
8 - API Integration
Integration with cc-backend and NATS
cc-backend REST API
The adapter communicates with cc-backend using its REST API to submit job information.
Configuration
Set these required configuration options:
{
"ccRestUrl": "https://my-cc-backend-instance.example",
"ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"ccRestSubmitJobs": true
}
- ccRestUrl: URL to cc-backend’s REST API (must not contain trailing slash)
- ccRestJwt: JWT token from cc-backend for REST API access
- ccRestSubmitJobs: Enable/disable REST API submissions (default: true)
Endpoints Used
The adapter uses the following cc-backend API endpoints:
| Endpoint | Method | Purpose |
|---|
/api/jobs/start_job/ | POST | Submit job start event |
/api/jobs/stop_job/<jobId> | POST | Submit job completion event |
Authentication
All API requests include a JWT bearer token in the Authorization header:
Authorization: Bearer <ccRestJwt>
Jobs are submitted in ClusterCockpit’s job metadata format, including:
- Job ID and cluster name
- User and project information
- Start and stop times
- Resource allocation (nodes, CPUs, GPUs)
- Job state and exit code
Error Handling
- Connection Errors: Adapter retries with exponential backoff
- Authentication Errors: Logged as errors; check JWT token validity
- Validation Errors: Logged with details about invalid fields
NATS Messaging
NATS integration is optional and provides real-time job notifications to other services.
Configuration
{
"natsServer": "mynatsserver.example",
"natsPort": 4222,
"natsSubject": "mysubject",
"natsUser": "myuser",
"natsPassword": "123456789"
}
Leave natsServer empty to disable NATS integration.
Authentication Methods
The adapter supports multiple NATS authentication methods:
1. Username/Password
{
"natsUser": "myuser",
"natsPassword": "mypassword"
}
See: NATS Username/Password Auth
2. Credentials File
{
"natsCredsFile": "/etc/cc-slurm-adapter/nats.creds"
}
See: NATS Credentials File
3. NKey Authentication
{
"natsNKeySeedFile": "/etc/cc-slurm-adapter/nats.nkey"
}
See: NATS NKey Auth
Jobs are published as JSON messages to the configured subject:
{
"jobId": "12345",
"cluster": "mycluster",
"user": "username",
"project": "projectname",
"startTime": 1234567890,
"stopTime": 1234567900,
"numNodes": 4,
"resources": { ... }
}
Use Cases
NATS integration is useful for:
- Real-time Monitoring: Other services can subscribe to job events
- Event-Driven Workflows: Trigger actions when jobs start/stop
- Alternative to REST: Can disable REST submission and use NATS-only
- Multi-Component Architecture: Multiple services consuming job events
- NATS adds minimal latency (typically < 1ms)
- Messages are fire-and-forget (no delivery guarantees by default)
- Consider using NATS JetStream for persistent queues if needed
Dual Submission Mode
By default, the adapter submits jobs to both cc-backend REST API and NATS:
{
"ccRestSubmitJobs": true,
"natsServer": "mynatsserver.example"
}
This ensures:
- cc-backend receives authoritative job data
- Other services can react to job events in real-time
NATS-Only Mode
For specialized deployments, you can disable REST submission:
{
"ccRestSubmitJobs": false,
"natsServer": "mynatsserver.example"
}
Warning: In this mode, you must ensure another component (e.g., a NATS subscriber) is forwarding job data to cc-backend, or jobs will not appear in the UI.