What is it good for?

ClusterCockpit is a framework for job-specific performance and power monitoring on distributed HPC clusters. The focus is on simple installation and maintenance, high security and intuitive usage.

Target group

HPC users

Overview about running and past batch jobs. Access to various performance metrics including hardware performance counter data. Powerful filter and sorting of batch jobs. Ability to group jobs using job tags.

Target group

Support staff

Overview about all clusters. Powerful filtering and sorting of batch job lists. Customisable statistical analysis of jobs. User list with usage statistics. Status view for all clusters.

Target group

Administrators

Single file deployment of web backend and metric store. Systemd setup for easy control. RPM and DEB packages for node agent. Authentication mechanisms: Local account, LDAP directory, and JWT token. REST API for integration in existing monitoring and batch job scheduler infrastructure.

images/cc-arch.png

ClusterCockpit overview

ClusterCockpit consists of

All components can also be used individually.

Node metrics are collected continuously and sent to the metrics store at fixed intervals. Job details are provided by an external adapter for the batch job scheduler and sent to cc-backend via a REST API. For running jobs, cc-backend queries the metrics store to collect all required time series data. Once a job is finished, it is persisted to a JSON file-based job archive that contains all job metadata and metrics data. Finished jobs are loaded from the job archive. The metrics store uses cyclic buffers and stores data only for a limited period of time.

ClusterCockpit is used at