ClusterCockpit

Job-specific Performance and Energy Monitoring for HPC Clusters

ClusterCockpit Job list
ClusterCockpit is a framework for job-specific performance and energy monitoring on HPC clusters with a focus on simple installation and maintenance, high security and intuitive usage.

Scalable

Supports multiple clusters in one web interface. Scales to thousands of nodes and millions of jobs. Supports heterogeneous clusters with and without node sharing.

Slurm integration

Comes with a ready to use Slurm integration. Can be integrated with any batch job scheduler via a REST API.

Read more

Modern Web UI

Responsive web interface with job-specific, node, user, and system views. Provides metric plots, aggregate statistics, roofline plots, resource table, and access to job scripts. Fully user configurable. A public dashboard is available for unauthenticated external users.

Read more

Global access to metric data

Access to configurable set of metrics including hardware performance counter data. Comes with powerful HPC centric node agent, but can also be integrated with other node agent solutions.

Read more

Authentication methods

Supports local accounts, LDAP, and KeyCloak OpenID Connect. Can be integrated with existing user portals using JWT based authentication.

Read more

User roles

Supports roles for users, project managers, support personnel, and administrators. Users can only see their own jobs. Metrics can be restricted to specific roles for fine-grained access control.

Read more

Job sorting and filtering

Powerful sorting of jobs according to all job metadata attributes. Filter for job metadata and aggregate metric data attributes.

Read more

Unified search bar

A unified search bar allows to search for job ids, job names, project ids, usernames, and names.

Read more

Job tagging

Jobs can be tagged manually or automatically. The built-in job tagger detects known applications (e.g. MATLAB, GROMACS) and flags pathological jobs automatically. Tags are grouped by type and have a configurable scope attribute for visibility control.

Read more

Node state monitoring

Live node health tracking with per-metric status and historical state retention. The systems view provides node lists with filtering, paging, and a health status overview per cluster and subcluster.

Read more

Flexible job archive

Supports multiple archive backends: JSON files, columnar Parquet files (with zstd compression), SQLite blob storage, and S3-compatible object storage. Cluster- and subcluster-specific retention policies are supported.

Read more

Real-time NATS API

Publish and subscribe to real-time job start/stop events and node state changes via NATS messaging. Enables event-driven integrations and live dashboards without polling.

Read more