This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

cc-slurm-adapter

ClusterCockpit Slurm Adapter References

1: Installation
2: cc-slurm-adapter Configuration
3: Daemon Setup
4: Prolog/Epilog Hooks
5: Usage
6: Troubleshooting
7: Architecture
8: API Integration

Reference information regarding the ClusterCockpit component “cc-slurm-adapter” (GitHub Repo).

Overview

cc-slurm-adapter is a software daemon that feeds cc-backend with job information from Slurm in realtime.

Key Features

Fault Tolerant: Handles cc-backend or Slurm downtime gracefully without losing jobs
Automatic Recovery: Submits jobs to cc-backend as soon as services are available again
Realtime Updates: Supports immediate job notification via Slurm Prolog/Epilog hooks
NATS Integration: Optional job notification messaging via NATS
Minimal Dependencies: Uses Slurm commands (sacct, squeue, sacctmgr, scontrol) - no slurmrestd required

Architecture

The daemon runs on the same node as slurmctld and operates in two modes:

Daemon Mode: Periodic synchronization (default: every 60 seconds) between Slurm and cc-backend
Prolog/Epilog Mode: Immediate trigger on job start/stop events (optional, reduces latency)

Data is submitted to cc-backend via REST API. Note: Slurm’s slurmdbd is mandatory.

Notice

You can set the Slurm option MinJobAge to prolong the duration Slurm will hold Job infos in memory.

Limitations

Resource Information Availability

Because slurmdbd does not store all job information, some details may be unavailable in certain cases:

Resource allocation information is obtained via scontrol --cluster XYZ show job XYZ --json
This information becomes unavailable a few minutes after job completion
If the daemon is stopped for too long, jobs may lack resource information
Critical Impact: Without resource information, cc-backend cannot associate jobs with metrics (CPU, GPU, memory)
Jobs will still be listed in cc-backend but metric visualization will not work

Slurm Version Compatibility

Supported Versions

These Slurm versions are known to work:

24.xx.x
25.xx.x

Compatibility Notes

All Slurm-related code is concentrated in slurm.go for easier maintenance. The most common compatibility issue is nil pointer dereference due to missing JSON fields.

Debugging Incompatibilities

If you encounter nil pointer dereferences:

Get a job ID via squeue or sacct

Check JSON layouts from both commands (they differ):

sacct -j 12345 --json
scontrol show job 12345 --json

SlurmInt and SlurmString Types

Slurm has been transitioning API formats:

SlurmInt: Handles both plain integers and Slurm’s “infinite/set” struct format
SlurmString: Handles both plain strings and string arrays (uses first element if array, blank if empty)

These custom types maintain backward compatibility across Slurm versions.

1 - Installation

Installing and building cc-slurm-adapter

Prerequisites

Go 1.24.0 or higher
Slurm with slurmdbd configured
cc-backend instance with API access
Access to the slurmctld node

Building from Source

Requirements

go 1.24.0+

Dependencies

Key dependencies (managed via go.mod):

github.com/ClusterCockpit/cc-lib - ClusterCockpit common library
github.com/nats-io/nats.go - NATS client

Compilation

make

This creates the cc-slurm-adapter binary.

Build Commands

# Build binary
make

# Format code
make format

# Clean build artifacts
make clean

2 - cc-slurm-adapter Configuration

cc-slurm-adapter configuration reference

Configuration File Location

Default: /etc/cc-slurm-adapter/config.json

Example Configuration

{
  "pidFilePath": "/run/cc-slurm-adapter/daemon.pid",
  "prepSockListenPath": "/run/cc-slurm-adapter/daemon.sock",
  "prepSockConnectPath": "/run/cc-slurm-adapter/daemon.sock",
  "lastRunPath": "/var/lib/cc-slurm-adapter/last_run",
  "slurmPollInterval": 60,
  "slurmQueryDelay": 1,
  "slurmQueryMaxSpan": 604800,
  "slurmQueryMaxRetries": 5,
  "ccPollInterval": 21600,
  "ccRestSubmitJobs": true,
  "ccRestUrl": "https://my-cc-backend-instance.example",
  "ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "gpuPciAddrs": {
    "^nodehostname0[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"],
    "^nodehostname1[0-9]$": ["00000000:00:10.0", "00000000:00:3F.0"]
  },
  "ignoreHosts": "^nodehostname9\\w+$",
  "natsServer": "mynatsserver.example",
  "natsPort": 4222,
  "natsSubject": "mysubject",
  "natsUser": "myuser",
  "natsPassword": "123456789",
  "natsCredsFile": "/etc/cc-slurm-adapter/nats.creds",
  "natsNKeySeedFile": "/etc/ss-slurm-adapter/nats.nkey"
}

Configuration Reference

Required Settings

Config Key	Type	Description
`ccRestUrl`	string	URL to cc-backend’s REST API (must not contain trailing slash)
`ccRestJwt`	string	JWT token from cc-backend for REST API access

Daemon Settings

Config Key	Type	Default	Description
`pidFilePath`	string	`/run/cc-slurm-adapter/daemon.pid`	Path to PID file (prevents concurrent execution)
`lastRunPath`	string	`/var/lib/cc-slurm-adapter/lastrun`	Path to file storing last successful sync timestamp (as file mtime)

Socket Settings

Config Key	Type	Default	Description
`prepSockListenPath`	string	`/run/cc-slurm-adapter/daemon.sock`	Socket for daemon to receive prolog/epilog events. Supports UNIX and TCP formats (see below)
`prepSockConnectPath`	string	`/run/cc-slurm-adapter/daemon.sock`	Socket for prolog/epilog mode to connect to daemon

Socket Formats:

UNIX: /run/cc-slurm-adapter/daemon.sock or unix:/run/cc-slurm-adapter/daemon.sock
TCP IPv4: tcp:127.0.0.1:12345 or tcp:0.0.0.0:12345
TCP IPv6: tcp:[::1]:12345, tcp:[::]:12345, tcp::12345

Slurm Polling Settings

Config Key	Type	Default	Description
`slurmPollInterval`	int	60	Interval (seconds) for periodic sync to cc-backend
`slurmQueryDelay`	int	1	Wait time (seconds) after prolog/epilog event before querying Slurm
`slurmQueryMaxSpan`	int	604800	Maximum time (seconds) to query jobs from the past (prevents flooding)
`slurmQueryMaxRetries`	int	10	Maximum Slurm query attempts on Prolog/Epilog events

cc-backend Settings

Config Key	Type	Default	Description
`ccPollInterval`	int	21600	Interval (seconds) to query all jobs from cc-backend (prevents stuck jobs)
`ccRestSubmitJobs`	bool	true	Submit started/stopped jobs to cc-backend via REST (set false if using NATS-only)

Hardware Mapping

Config Key	Type	Default	Description
`gpuPciAddrs`	object	`{}`	Map of hostname regexes to GPU PCI address arrays (must match NVML/nvidia-smi order)
`ignoreHosts`	string	`""`	Regex of hostnames to ignore (jobs only on matching hosts are discarded)

NATS Settings

Config Key	Type	Default	Description
`natsServer`	string	`""`	NATS server hostname (leave blank to disable NATS)
`natsPort`	uint16	4222	NATS server port
`natsSubject`	string	`"jobs"`	Subject to publish job information to
`natsUser`	string	`""`	NATS username (for user auth)
`natsPassword`	string	`""`	NATS password
`natsCredsFile`	string	`""`	Path to NATS credentials file
`natsNKeySeedFile`	string	`""`	Path to NATS NKey seed file (private key)

Note: The deprecated ipcSockPath option has been removed. Use prepSockListenPath and prepSockConnectPath instead.

3 - Daemon Setup

Setting up cc-slurm-adapter as a daemon

The daemon mode is required for cc-slurm-adapter to function. This page describes how to set up the daemon using systemd.

1. Copy Binary and Configuration

Copy the binary and create a configuration file:

sudo mkdir -p /opt/cc-slurm-adapter
sudo cp cc-slurm-adapter /opt/cc-slurm-adapter/
sudo cp config.json /opt/cc-slurm-adapter/

Security: The config file contains sensitive credentials (JWT, NATS). Set appropriate permissions:

sudo chmod 600 /opt/cc-slurm-adapter/config.json

2. Create System User

sudo useradd -r -s /bin/false cc-slurm-adapter
sudo chown -R cc-slurm-adapter:slurm /opt/cc-slurm-adapter

3. Grant Slurm Permissions

The adapter user needs permission to query Slurm:

sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator

Critical: If permissions are not set and Slurm is restricted, NO JOBS WILL BE REPORTED.

4. Install systemd Service

Create /etc/systemd/system/cc-slurm-adapter.service:

[Unit]
Description=cc-slurm-adapter
Wants=network.target
After=network.target

[Service]
User=cc-slurm-adapter
Group=slurm
ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -config /opt/cc-slurm-adapter/config.json
WorkingDirectory=/opt/cc-slurm-adapter/
RuntimeDirectory=cc-slurm-adapter
RuntimeDirectoryMode=0750
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

Notes:

RuntimeDirectory creates /run/cc-slurm-adapter for PID and socket files
Group=slurm allows Prolog/Epilog (running as slurm user) to access the socket
RuntimeDirectoryMode=0750 enables group access

5. Enable and Start Service

sudo systemctl daemon-reload
sudo systemctl enable cc-slurm-adapter
sudo systemctl start cc-slurm-adapter

Verification

Check that the service is running:

sudo systemctl status cc-slurm-adapter

You should see output indicating the service is active and running.

4 - Prolog/Epilog Hooks

Setting up Prolog/Epilog hooks for immediate job notification

Prolog/Epilog hook setup is optional but recommended for immediate job notification, which reduces latency compared to relying solely on periodic polling.

Prerequisites

Daemon must be running (see Daemon Setup)
Hook script must be accessible from slurmctld
Hook script must exit with code 0 to avoid rejecting job allocations

1. Create Hook Script

Create /opt/cc-slurm-adapter/hook.sh:

#!/bin/sh
/opt/cc-slurm-adapter/cc-slurm-adapter
exit 0

Make it executable:

sudo chmod +x /opt/cc-slurm-adapter/hook.sh

Important: Always exit with 0. Non-zero exit codes will reject job allocations.

2. Configure Slurm

Add to slurm.conf:

PrEpPlugins=prep/script
PrologSlurmctld=/opt/cc-slurm-adapter/hook.sh
EpilogSlurmctld=/opt/cc-slurm-adapter/hook.sh

3. Restart slurmctld

sudo systemctl restart slurmctld

Note: If using non-default socket path, add -config /path/to/config.json to hook.sh. The config file must be readable by the slurm user/group.

Multi-Cluster Setup

For multiple slurmctld nodes, use TCP sockets instead of UNIX sockets:

{
  "prepSockListenPath": "tcp:0.0.0.0:12345",
  "prepSockConnectPath": "tcp:slurmctld-host:12345"
}

This allows Prolog/Epilog hooks on different nodes to connect to the daemon over the network.

How It Works

Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
Socket Message: Hook sends job ID to daemon via socket
Immediate Query: Daemon queries Slurm for that specific job
Fast Submission: Job submitted to cc-backend with minimal delay

This reduces the job notification latency from up to 60 seconds (default poll interval) to just a few seconds.

5 - Usage

Command line usage and operation modes

Command Line Flags

Flag	Description
`-config <path>`	Specify the path to the config file (default: `/etc/cc-slurm-adapter/config.json`)
`-daemon`	Run in daemon mode (if omitted, runs in Prolog/Epilog mode)
`-debug <log-level>`	Set the log level (default: 2, max: 5)
`-help`	Show help for all command line flags

Operation Modes

Daemon Mode

Run the adapter as a persistent daemon that periodically synchronizes job information:

cc-slurm-adapter -daemon -config /path/to/config.json

This mode:

Runs continuously in the background
Queries Slurm at regular intervals (default: 60 seconds)
Submits job information to cc-backend
Should be managed by systemd (see Daemon Setup)

Prolog/Epilog Mode

Run the adapter from Slurm’s Prolog/Epilog hooks for immediate job notification:

cc-slurm-adapter

This mode:

Only runs when triggered by Slurm (job start/stop)
Sends job ID to the running daemon via socket
Exits immediately
Must be invoked from Slurm hook scripts (see Prolog/Epilog Setup)

Best Practices

Production Deployment

Keep Daemon Running: Resource info expires quickly after job completion
Monitor Logs: Watch for Slurm API changes or nil pointer errors
Secure Credentials: Restrict config file permissions (600 or 640)
Use Prolog/Epilog Carefully: Always exit with 0 to avoid blocking job allocations
Test Before Production: Verify in development environment first

Performance Tuning

High Job Volume: Reduce slurmPollInterval if periodic sync causes lag
Low Latency Required: Enable Prolog/Epilog hooks
Resource Constrained: Increase ccPollInterval (reduces cc-backend queries)

Debug Logging

Enable verbose logging for troubleshooting:

cc-slurm-adapter -daemon -debug 5 -config /path/to/config.json

Log Levels:

2 (default): Errors and warnings
5 (max): Verbose debug output

For systemd services, edit the service file to add -debug 5 to the ExecStart line.

6 - Troubleshooting

Debugging and common issues

Check Service Status

Verify the daemon is running:

sudo systemctl status cc-slurm-adapter

You should see output indicating the service is active (running).

View Logs

cc-slurm-adapter logs to stderr (captured by systemd):

sudo journalctl -u cc-slurm-adapter -f

Use -f to follow logs in real-time, or omit it to view historical logs.

Enable Debug Logging

Edit the systemd service file to add -debug 5:

ExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -debug 5 -config /opt/cc-slurm-adapter/config.json

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart cc-slurm-adapter

Log Levels:

2 (default): Errors and warnings
5 (max): Verbose debug output

Common Issues

Issue	Possible Cause	Solution
No jobs reported	Missing Slurm permissions	Run `sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator`
Socket connection errors	Wrong socket path or permissions	Check `prepSockListenPath`/`prepSockConnectPath` and RuntimeDirectoryMode
Prolog/Epilog failures	Non-zero exit code in hook script	Ensure hook script exits with `exit 0`
Missing resource info	Daemon stopped too long	Keep daemon running; resource info expires minutes after job completion
Job allocation failures	Prolog/Epilog exit code ≠ 0	Check hook script and ensure cc-slurm-adapter is running

Debugging Slurm Compatibility Issues

If you encounter nil pointer dereferences or unexpected errors:

Get a job ID via squeue or sacct:
```
squeue
# or
sacct
```

Check JSON layouts from both commands (they differ):

sacct -j 12345 --json
scontrol show job 12345 --json

Compare the output with what the adapter expects in slurm.go
Report issues to the GitHub repository with:
- Slurm version
- JSON output samples
- Error messages from logs

Verifying Configuration

Check that your configuration is valid:

# Test if config file is readable
cat /opt/cc-slurm-adapter/config.json

# Verify JSON syntax
jq . /opt/cc-slurm-adapter/config.json

Testing Connectivity

Test cc-backend Connection

# Test REST API endpoint (replace with your JWT)
curl -H "Authorization: Bearer YOUR_JWT_TOKEN" \
     https://your-cc-backend-instance.example/api/jobs/

Test NATS Connection

If using NATS, verify connectivity:

# Using nats-cli (if installed)
nats server check -s nats://mynatsserver.example:4222

Performance Issues

If the adapter is slow or missing jobs:

Check Slurm Response Times: Run sacct and squeue manually to see if Slurm is responding slowly
Adjust Poll Intervals: Lower slurmPollInterval for more frequent checks (but higher load)
Enable Prolog/Epilog: Reduces dependency on polling for immediate job notification
Check System Resources: Ensure adequate CPU/memory on the slurmctld node

7 - Architecture

Technical architecture and internal details

Synchronization Flow

The daemon operates on a periodic synchronization cycle:

Timer Trigger: Periodic timer (default: 60s) triggers sync
Query Slurm: Fetch job data via sacct, squeue, scontrol
Submit to cc-backend: POST job start/stop via REST API
Publish to NATS: Optional notification message (if enabled)

This ensures that all jobs are eventually captured, even if Prolog/Epilog hooks fail or are not configured.

Prolog/Epilog Flow

When Prolog/Epilog hooks are enabled, immediate job notification works as follows:

Job Event: Slurm triggers Prolog/Epilog hook when a job starts or stops
Socket Message: Hook sends job ID to daemon via socket
Immediate Query: Daemon queries Slurm for that specific job
Fast Submission: Job submitted to cc-backend with minimal delay

This reduces latency from up to 60 seconds (default poll interval) to just a few seconds.

Data Sources

The adapter queries multiple Slurm commands to build complete job information:

Slurm Command	Purpose
`sacct`	Historical job accounting data
`squeue`	Current job queue information
`scontrol show job`	Resource allocation details (JSON format)
`sacctmgr`	User permissions

Important: scontrol show job provides critical resource allocation information (nodes, CPUs, GPUs) that is only available while the job is in memory. This information typically expires a few minutes after job completion, which is why keeping the daemon running continuously is essential.

State Persistence

The adapter maintains minimal state on disk:

Last Run Timestamp: Stored as file modification time in lastRunPath
- Used to determine which jobs to query on startup
- Prevents flooding cc-backend with historical jobs after restarts
PID File: Stored in pidFilePath
- Prevents concurrent daemon execution
- Automatically cleaned up on graceful shutdown
Socket: IPC between daemon and Prolog/Epilog instances
- Created at prepSockListenPath (daemon listens)
- Connected at prepSockConnectPath (Prolog/Epilog connects)
- Supports both UNIX domain sockets and TCP sockets

Fault Tolerance

The adapter is designed to be fault-tolerant:

Slurm Downtime

Retries Slurm queries with exponential backoff
Continues operation once Slurm becomes available
No job loss during Slurm restarts

cc-backend Downtime

Queues jobs internally (up to slurmQueryMaxSpan seconds in the past)
Submits queued jobs once cc-backend is available
Prevents duplicate submissions

Daemon Restarts

Uses lastRunPath timestamp to catch up on missed jobs
Limited by slurmQueryMaxSpan to prevent overwhelming the system
Resource allocation data may be lost for jobs that completed while daemon was down

Multi-Cluster Considerations

For environments with multiple Slurm clusters:

Run one daemon instance per slurmctld node
Use cluster-specific configuration files
Consider TCP sockets for Prolog/Epilog if slurmctld is not on compute nodes

Performance Characteristics

Resource Usage

Memory: Minimal (< 50 MB typical)
CPU: Low (periodic bursts during synchronization)
Network: Moderate (REST API calls to cc-backend, NATS if enabled)

Scalability

Tested with clusters of 1000+ nodes
Handle thousands of jobs per day
Poll interval can be tuned based on job submission rate

Latency

Without Prolog/Epilog: Up to slurmPollInterval seconds (default: 60s)
With Prolog/Epilog: Typically < 5 seconds

8 - API Integration

Integration with cc-backend and NATS

cc-backend REST API

The adapter communicates with cc-backend using its REST API to submit job information.

Configuration

Set these required configuration options:

{
  "ccRestUrl": "https://my-cc-backend-instance.example",
  "ccRestJwt": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "ccRestSubmitJobs": true
}

ccRestUrl: URL to cc-backend’s REST API (must not contain trailing slash)
ccRestJwt: JWT token from cc-backend for REST API access
ccRestSubmitJobs: Enable/disable REST API submissions (default: true)

Endpoints Used

The adapter uses the following cc-backend API endpoints:

Endpoint	Method	Purpose
`/api/jobs/start_job/`	POST	Submit job start event
`/api/jobs/stop_job/<jobId>`	POST	Submit job completion event

Authentication

All API requests include a JWT bearer token in the Authorization header:

Authorization: Bearer <ccRestJwt>

Job Data Format

Jobs are submitted in ClusterCockpit’s job metadata format, including:

Job ID and cluster name
User and project information
Start and stop times
Resource allocation (nodes, CPUs, GPUs)
Job state and exit code

Error Handling

Connection Errors: Adapter retries with exponential backoff
Authentication Errors: Logged as errors; check JWT token validity
Validation Errors: Logged with details about invalid fields

NATS Messaging

NATS integration is optional and provides real-time job notifications to other services.

Configuration

{
  "natsServer": "mynatsserver.example",
  "natsPort": 4222,
  "natsSubject": "mysubject",
  "natsUser": "myuser",
  "natsPassword": "123456789"
}

Leave natsServer empty to disable NATS integration.

Authentication Methods

The adapter supports multiple NATS authentication methods:

1. Username/Password

{
  "natsUser": "myuser",
  "natsPassword": "mypassword"
}

See: NATS Username/Password Auth

2. Credentials File

{
  "natsCredsFile": "/etc/cc-slurm-adapter/nats.creds"
}

See: NATS Credentials File

3. NKey Authentication

{
  "natsNKeySeedFile": "/etc/cc-slurm-adapter/nats.nkey"
}

See: NATS NKey Auth

Message Format

Jobs are published as JSON messages to the configured subject:

{
  "jobId": "12345",
  "cluster": "mycluster",
  "user": "username",
  "project": "projectname",
  "startTime": 1234567890,
  "stopTime": 1234567900,
  "numNodes": 4,
  "resources": { ... }
}

Use Cases

NATS integration is useful for:

Real-time Monitoring: Other services can subscribe to job events
Event-Driven Workflows: Trigger actions when jobs start/stop
Alternative to REST: Can disable REST submission and use NATS-only
Multi-Component Architecture: Multiple services consuming job events

Performance Considerations

NATS adds minimal latency (typically < 1ms)
Messages are fire-and-forget (no delivery guarantees by default)
Consider using NATS JetStream for persistent queues if needed

Dual Submission Mode

By default, the adapter submits jobs to both cc-backend REST API and NATS:

{
  "ccRestSubmitJobs": true,
  "natsServer": "mynatsserver.example"
}

This ensures:

cc-backend receives authoritative job data
Other services can react to job events in real-time

NATS-Only Mode

For specialized deployments, you can disable REST submission:

{
  "ccRestSubmitJobs": false,
  "natsServer": "mynatsserver.example"
}

Warning: In this mode, you must ensure another component (e.g., a NATS subscriber) is forwarding job data to cc-backend, or jobs will not appear in the UI.

cc-slurm-adapter

Overview

Key Features

Architecture

Notice

Limitations

Resource Information Availability

Slurm Version Compatibility

Supported Versions

Compatibility Notes

Debugging Incompatibilities

SlurmInt and SlurmString Types

Links

1 - Installation

Prerequisites

Building from Source

Requirements

Dependencies

Compilation

Build Commands

2 - cc-slurm-adapter Configuration

Configuration File Location

Example Configuration

Configuration Reference

Required Settings

Daemon Settings

Socket Settings

Slurm Polling Settings

cc-backend Settings

Hardware Mapping

NATS Settings

3 - Daemon Setup

1. Copy Binary and Configuration

2. Create System User

3. Grant Slurm Permissions

4. Install systemd Service

5. Enable and Start Service

Verification

4 - Prolog/Epilog Hooks

Prerequisites

1. Create Hook Script

2. Configure Slurm

3. Restart slurmctld

Multi-Cluster Setup

How It Works

5 - Usage

Command Line Flags

Operation Modes

Daemon Mode

Prolog/Epilog Mode

Best Practices

Production Deployment

Performance Tuning

Debug Logging

6 - Troubleshooting

Check Service Status

View Logs

Enable Debug Logging

Common Issues

Debugging Slurm Compatibility Issues

Verifying Configuration

Testing Connectivity

Test cc-backend Connection

Test NATS Connection

Performance Issues

7 - Architecture

Synchronization Flow

Prolog/Epilog Flow

Data Sources

State Persistence

Fault Tolerance

Slurm Downtime

cc-backend Downtime

Daemon Restarts

Multi-Cluster Considerations

Performance Characteristics

Resource Usage

Scalability

Latency

8 - API Integration