This is the multi-page printable view of this section. Click here to print.
Tutorials
1 - ClusterCockpit installation manual
Introduction
ClusterCockpit requires the following components:
- A node agent running on all compute nodes that measures required metrics and
forwards all data to a time series metrics database. ClusterCockpit provides
its own node agent
cc-metric-collector
. This is the recommended setup, but ClusterCockpit can also be integrated with other node agents, e.g.collectd
,prometheus
ortelegraf
. In this case you have to use it with the accompanying time series database. - A metric time series database. ClusterCockpit provides its own solution
cc-metric-store
, that is the recommended solution. There is also metric store support for Prometheus and InfluxDB. InfluxDB is currently barely tested. Usually only one instance of the time series database is required. - The api and web interface backend
cc-backend
. Only one instance ofcc-backend
is required. This will provide the HTTP server at the desired monitoring URL for serving the web interface. - A SQL database. It is recommended to use the builtin sqlite database for
ClusterCockpit. You can setup LiteStream as a service
which performs a continuous replication of the sqlite database to multiple
storage backends. Optionally
cc-backend
also supports MariaDB/MySQL as SQL database backends. - A batch job scheduler adapter that provides the job meta information to
cc-backend
. This is done by using the provided REST api for starting and stopping jobs. For Slurm there is a Python based solution (cc-slurm-sync ) maintained by PC2 Paderborn is available. For HTCondor there also exists cc-condor-sync.
Server Hardware
cc-backend
is threaded and therefore profits from multiple cores. It does not
require a lot of memory. It is possible to run it in a virtual machine. For best
performance the ./var
folder of cc-backend
which contains the sqlite
database file and the file based job archive should be located on a fast storage
device, ideally a NVMe SSD. The sqlite database file and the job archive will
grow over time (if you are not removing old jobs using a retention policy).
Our setup covering five clusters over 4 years take 50GB for the sqlite database
and around 700GB for the job archive.
cc-metric-store
is also threaded and requires a fixed amount of main memory.
How much depends on your configuration, but 128GB should be enough for most
setups. We run cc-backend
and cc-metric-store
on the same server as
systemd services.
Planning and initial configuration
We recommended the following order for planning and configuring a ClusterCockpit installation:
- Setup your metric list: With two exceptions you are in general free which
metrics you want choose. Those exceptions are:
mem_bw
for main memory bandwidth and ‘flops_any’ for flop throughput (double precision flops are upscaled to single precision rates). You can find a discussion of useful metrics and their naming here. This metric list is an integral component for the configuration of all ClusterCockpit components. - Configure and deploy
cc-metric-store
. - Configure and deploy
cc-metric-collector
. For a detailed description on how to setup cc-metric-collector have a look at /docs/tutorials/prod-ccmc/ - Configure and deploy
cc-backend
- Setup batch job scheduler adapter
Common problems
Up front here is a list with common issues people are facing when installing ClusterCockpit for the first time.
Inconsistent metric names across components
At the moment you need to configure the metric list in every component
separately. In cc-metric-collector
the metrics that are send to the
cc-metric-store
are determined by the collector configuration and possible
renaming in the router configuration. For cc-metric-store
in config.json
you
need to specify a metric list in-order to configure the native metric frequency
and how a metric is aggregated. Metrics that are send to cc-metric-store
and
do not appear in its configuration are silently dropped!
In cc-backend
for every cluster you need to create a cluster.json
configuration in the job-archive. There you setup which metrics are shown in the
web-frontend including many additional properties for the metrics. For running
jobs cc-backend
will query cc-metric-store
for exactly those metric names
and if there is no match there will be an error.
We provide a json schema based specification as part of the job meta and metric
schema. This specification recommends a minimal set of metrics and we suggest to
use the metric names provided there. While it is up to you if you want to adhere
to the metric names suggested in the schema, there are two exceptions: mem_bw
(main memory bandwidth) and flops_any
(total flop rate with DP flops scaled to
SP flops) are required for the roofline plots to work.
Inconsistent device naming between cc-metric-collector
and batch job scheduler adapter
The batch job scheduler adapter (e.g. cc-slurm-sync
) provides a list of
resources that are used by the job. cc-backend
will query cc-metric-store
with exactly those resource ids for getting all metrics for a job.
As a consequence if cc-metric-collector
uses another systematic the metrics
will not be found.
If you have GPU accelerators cc-slurm-sync
should use the PCI-E device
addresses as ids. The option use_pci_info_as_type_id
for the nvidia and
rocm-smi collectors in the collector configuration must be set to true.
To validate and debug problems you can use the cc-metric-store
debug endpoint:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"
This will return the current state of cc-metric-store
. You can search for a
hostname and there scroll for all topology leaf nodes that are available.
Missing nodes in subcluster node lists
ClusterCockpit supports multiple subclusters as part of a cluster. A subcluster
in this context is a homogeneous hardware partition with a dedicated metric
and device configuration. cc-backend
dynamically matches the nodes a job runs
on to subcluster node list to figure out on which subcluster a job is running.
If nodes are missing in a subcluster node list this fails and the metric list
used may be wrong.
2 - Decide on metric list
Introduction
To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.
ClusterCockpit supports multiple clusters within one instance of cc-backend
.
You have to create separate metric lists for each of them. In cc-backend
the
metric lists are provided as part of the cluster configuration. Every cluster is
configured as part of the
job archive using one
cluster.json
file per cluster.
This how-to describes
in-detail how to create a cluster.json
file.
Required Metrics
Flop throughput rate: flops_any
Memory bandwidth: mem_bw
Memory capacity used: mem_used
Requested cpu core utilization: cpu_load
Total fast network bandwidth: net_bw
Total file IO bandwidth: file_bw
Recommended CPU Metrics
Instructions throughput in cycles: ipc
User active CPU core utilization: cpu_user
Double precision flop throughput rate: flops_dp
Single precision flop throughput rate: flops_sp
Average core frequency: clock
CPU power consumption: rapl_power
Recommended GPU Metrics
GPU utilization: acc_used
GPU memory capacity used: acc_mem_used
GPU power consumption: acc_power
Recommended node level metrics
Ethernet read bandwidth: eth_read_bw
Ethernet write bandwidth: eth_write_bw
Fast network read bandwidth: ic_read_bw
Fast network write bandwidth: ic_write_bw
File system metrics
Warning
A file system metric tree is currently not yet supported incc-backend
In the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:
- nfs
- lustre
- gpfs
- nvme
- ssd
- hdd
- beegfs
File system read bandwidth: read_bw
File system write bandwidth: write_bw
File system read requests: read_req
File system write requests: write_req
File system inodes used: inodes
File system open and close: accesses
File system file syncs: fsync
File system file creates: create
File system file open: open
File system file close: close
File system file syncs: seek
3 - Setup of cc-metric-store
Introduction
4 - Setup of cc-metric-collector
Introduction
5 - Setup of cc-backend
Introduction
Recommended workflow for deployment
Why we do not provide a docker container
The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.It is recommended to install all ClusterCockpit components in a common
directory, e.g. /opt/monitoring
, var/monitoring
or var/clustercockpit
. In
the following we use /opt/monitoring
.
Two systemd services run on the central monitoring server:
- clustercockpit : binary cc-backend in
/opt/monitoring/cc-backend
. - cc-metric-store : Binary cc-metric-store in
/opt/monitoring/cc-metric-store
.
ClusterCockpit is deployed as a single binary that embeds all static assets.
We recommend keeping all cc-backend
binary versions in a folder archive
and
linking the currently active one from the cc-backend
root.
This allows for easy roll-back in case something doesn’t work.
Please Note
cc-backend
is started with root rights to open the privileged ports (80 and
443). It is recommended to set the configuration options user
and group
, in
which case cc-backend
will drop root permissions once the ports are taken.
You have to take care, that the ownership of the ./var
folder and
its contents are set accordingly.Workflow to deploy new version
This example assumes the DB and job archive versions did not change.
- Stop systemd service:
sudo systemctl stop clustercockpit.service
- Backup the sqlite DB file! This is as simple as to copy it.
- Copy new
cc-backend
binary to/opt/monitoring/cc-backend/archive
(Tip: Use a date tag likeYYYYMMDD-cc-backend
). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
- Link from
cc-backend
root to current version
ln -s /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
- Start systemd service:
sudo systemctl start clustercockpit.service
- Check if everything is ok:
sudo systemctl status clustercockpit.service
- Check log for issues:
sudo journalctl -u clustercockpit.service
- Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!
6 - Contribute documentation
We use Hugo to format and generate our website, the Docsy theme for styling and site structure. Hugo is an open-source static site generator that provides us with templates, content organisation in a standard directory structure, and a website generation engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.
All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.
Quick start
Here’s a quick guide to updating the docs. It assumes you’re familiar with the GitHub workflow and you’re happy to use the automated preview of your doc updates:
- Fork the cc-docs repo on GitHub.
- Make your changes and send a pull request (PR).
- If you’re not yet ready for a review, add “WIP” to the PR name to indicate it’s a work in progress.
- Preview the website locally as described beyond.
- Continue updating your doc and pushing your changes until you’re happy with the content.
- When you’re ready for a review, add a comment to the PR, and remove any “WIP” markers.
Updating a single page
If you’ve just spotted something you’d like to change while using the docs, Docsy has a shortcut for you:
- Click Edit this page in the top right hand corner of the page.
- If you don’t already have an up to date fork of the project repo, you are prompted to get one - click Fork this repository and propose changes or Update your Fork to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
Previewing your changes locally
If you want to run your own local Hugo server to preview your changes as you work:
- Follow the instructions in Getting started to install Hugo and any other tools you need. You’ll need at least Hugo version 0.45 (we recommend using the most recent available version), and it must be the extended version, which supports SCSS.
- Fork the cc-docs repo into your own project, then create a local copy using
git clone
. Don’t forget to use--recurse-submodules
or you won’t pull down some of the code you need to generate a working site.
git clone --recurse-submodules --depth 1 https://github.com/ClusterCockpit/cc-doc.git
- Run
hugo server
in the site root directory. By default your site will be available at http://localhost:1313/. Now that you’re serving your site locally, Hugo will watch for changes to the content and automatically refresh your site. - Continue with the usual GitHub workflow to edit files, commit them, push the changes up to your fork, and create a pull request.
Creating an issue
If you’ve found a problem in the docs, but you’re not sure how to fix it yourself, please create an issue in the cc-docs. You can also create an issue about a specific page by clicking the Create Issue button in the top right hand corner of the page.
Useful resources
- Docsy user guide: All about Docsy, including how it manages navigation, look and feel, and multi-language support.
- Hugo documentation: Comprehensive reference for Hugo.
- Github Hello World!: A basic introduction to GitHub concepts and workflow.