ClusterCockpit

Job-specific Performance and Energy Monitoring for HPC Clusters

ClusterCockpit is a framework for job-specific performance and energy monitoring on HPC clusters with a focus on simple installation and maintenance, high security and intuitive usage.

Scalable

Supports multiple clusters in one web interface. Scales to thousands of nodes and millions of jobs. Supports heterogeneous clusters with and without node sharing.

Slurm integration

Comes with a ready to use Slurm integration. Can be integrated with any batch job scheduler via a REST API.

Modern Web UI

Responsive web interface with job-specific, node, user, and system views. Provides metric plots, aggregate statistics, roofline plots, resource table, and access to job scripts. Fully user configurable. A public dashboard is available for unauthenticated external users.

Global access to metric data

Access to configurable set of metrics including hardware performance counter data. Comes with powerful HPC centric node agent, but can also be integrated with other node agent solutions.

Authentication methods

Supports local accounts, LDAP, and KeyCloak OpenID Connect. Can be integrated with existing user portals using JWT based authentication.

User roles

Supports roles for users, project managers, support personnel, and administrators. Users can only see their own jobs. Metrics can be restricted to specific roles for fine-grained access control.

Job sorting and filtering

Powerful sorting of jobs according to all job metadata attributes. Filter for job metadata and aggregate metric data attributes.

Unified search bar

A unified search bar allows to search for job ids, job names, project ids, usernames, and names.

Job tagging

Jobs can be tagged manually or automatically. The built-in job tagger detects known applications (e.g. MATLAB, GROMACS) and flags pathological jobs automatically. Tags are grouped by type and have a configurable scope attribute for visibility control.

Node state monitoring

Live node health tracking with per-metric status and historical state retention. The systems view provides node lists with filtering, paging, and a health status overview per cluster and subcluster.

Flexible job archive

Supports multiple archive backends: JSON files, columnar Parquet files (with zstd compression), SQLite blob storage, and S3-compatible object storage. Cluster- and subcluster-specific retention policies are supported.

Real-time NATS API

Publish and subscribe to real-time job start/stop events and node state changes via NATS messaging. Enables event-driven integrations and live dashboards without polling.

HPC Centers using ClusterCockpit

TU Darmstadt, Ruhr University Bochum, University Duisburg-Essen