This is the multi-page printable view of this section. Click here to print.
Documentation
- 1: Overview
- 2: Release specific infos
- 3: Getting Started
- 4: Tutorials
- 4.1: ClusterCockpit installation manual
- 4.2: Decide on metric list
- 4.3: Setup of cc-metric-store
- 4.4: Setup of cc-metric-collector
- 4.5: Setup of cc-backend
- 4.6: Contribute documentation
- 5: How-to Guides
- 5.1: Tips for cc-backend frontend development
- 5.2: Migrations
- 5.3:
- 5.4: Hands-On Demo
- 5.5: How to add a notification banner
- 5.6: How to create a `cluster.json` file
- 5.7: How to customize cc-backend
- 5.8: How to deploy and update cc-backend
- 5.9: How to generate JWT tokens
- 5.10: How to prepare a new release
- 5.11: How to regenerate the Swagger UI documentation
- 5.12: How to setup a systemd service
- 5.13: How to use the Swagger UI documentation
- 5.14: Unit tests
- 6: Explanation
- 6.1: Authentication
- 6.2: Configuration Management
- 6.3: Job Archive
- 6.4: JSON Web Token
- 6.5: Metric Store
- 6.6: Roles
- 7: Reference
- 7.1: Backend
- 7.1.1: Command Line
- 7.1.2: Configuration
- 7.1.3: Environment
- 7.1.4: REST API
- 7.1.5: Authentication Handbook
- 7.1.6: Job Archive Handbook
- 7.1.7: Schemas
- 7.1.7.1: Application Config Schema
- 7.1.7.2: Cluster Schema
- 7.1.7.3: Job Data Schema
- 7.1.7.4: Job Statistics Schema
- 7.1.7.5: Unit Schema
- 7.1.7.6: Job Archive Metadata Schema
- 7.1.7.7: Job Archive Metrics Data Schema
- 7.2: Metric Store
- 7.2.1: Command Line
- 7.2.2: Configuration
- 7.2.3: Metric Store REST API
- 7.3: cc-event-store
- 7.3.1: cc-event-store's REST API
- 7.3.2: cc-event-store's storage backends
- 7.3.2.1: Storage backend for Postgres
- 7.3.2.2: Storage backend for SQLite3
- 7.4: cc-metric-collector
- 7.4.1: cc-metric-collector's collectors
- 7.4.1.1: BeeGFS on Demand collector
- 7.4.1.2: BeeGFS on Demand collector
- 7.4.1.3: cpufreq_cpuinfo collector
- 7.4.1.4: cpufreq_cpuinfo collector
- 7.4.1.5: cpustat collector
- 7.4.1.6: customcmd collector
- 7.4.1.7: diskstat collector
- 7.4.1.8: gpfs collector
- 7.4.1.9: ibstat collector
- 7.4.1.10: iostat collector
- 7.4.1.11: ipmistat collector
- 7.4.1.12: likwid collector
- 7.4.1.13: loadavg collector
- 7.4.1.14: lustrestat collector
- 7.4.1.15: memstat collector
- 7.4.1.16: netstat collector
- 7.4.1.17: nfs3stat collector
- 7.4.1.18: nfs4stat collector
- 7.4.1.19: nfsiostat collector
- 7.4.1.20: numastat collector
- 7.4.1.21: nvidia collector
- 7.4.1.22: rapl collector
- 7.4.1.23: rocm_smi collector
- 7.4.1.24: schedstat collector
- 7.4.1.25: self collector
- 7.4.1.26: tempstat collector
- 7.4.1.27: topprocs collector
- 7.4.2: cc-metric-collector's message processor
- 7.4.3: cc-metric-collector's receivers
- 7.4.3.1: http receiver
- 7.4.3.2: IPMI Receiver
- 7.4.3.3: nats receiver
- 7.4.3.4: prometheus receiver
- 7.4.3.5: Redfish receiver
- 7.4.4: cc-metric-collector's router
- 7.4.5: cc-metric-collector's sinks
- 7.4.5.1: ganglia sink
- 7.4.5.2: http sink
- 7.4.5.3: influxasync sink
- 7.4.5.4: influxdb sink
- 7.4.5.5: libganglia sink
- 7.4.5.6: nats sink
- 7.4.5.7: prometheus sink
- 7.4.5.8: stdout sink
- 7.5: Commit message naming conventions
- 7.6: Docsy example page
- 8: Web Interface
1 - Overview
What is it?
ClusterCockpit is a monitoring framework for job-specific performance and power monitoring on distributed HPC clusters. The focus is put on simple installation and maintenance, high security and intuitive usage. ClusterCockpit provides a modern web interface which provides:
- HPC Users an overview about their running and past batch jobs with access to various metrics including hardware performance counter data. Jobs can be sorted, filtered, and tagged.
- Support staff an easy access to all job data on multiple clusters. Jobs and users can be sorted and filtered using a very flexible interface. Job and user data can be aggregated using a customisable statistical analysis. There is a status view providing an overview for all clusters.
- Administrators single file deployment for the ClusterCockpit web backend. A Systemd setup for easy control. RPM and DEB packages for the node agent. For authentication local accounts, LDAP, and JWT tokens are supported. There exists an extensive REST API to integrate into a existing monitoring and batch job scheduler infrastructure.
ClusterCockpit is used in production at several HPC computing centers, you can find a list here.
How does it work?
ClusterCockpit consists of
- the web user interface and API backend cc-backend
- the node agent cc-metric-collector
- and the in-memory metric cache cc-metric-store
All components can also be used individually.
Node metrics are collected continuously and sent to the metrics store at fixed intervals. Job details are provided by an external adapter for the batch job scheduler and sent to cc-backend via a REST API. For running jobs, cc-backend queries the metrics store to collect all required time series data. Once a job is finished, it is persisted to a JSON file-based job archive that contains all job metadata and metrics data. Finished jobs are loaded from the job archive. The metrics store uses cyclic buffers and stores data only for a limited period of time.
Where should I go next?
Give your users next steps from the Overview. For example:
- Getting Started: Get started with ClusterCockpit
- User guide: A user guide for the ClusterCockpit web interface
2 - Release specific infos
New performance and energy footprint configuration
In previous versions cc-backend
used a set of hard-coded metrics for the
performance footprint. The database had dedicated columns for each of these
metric stats in order to filter jobs using those performance metrics.
Because you may want to use different footprints on an accelerated cluster
compared to a standard multi-core system, this is a severe restriction.
Version 1.4.0 of cc-backend
introduces a new string attribute footprint
for metrics
in the cluster.json
configuration of the job archive. This allows you do
define your individual performance footprint for every cluster and optionally
subcluster. This also enables you to change the footprint configuration if required.
The footprint metrics will be used in the footprint UI component
shown in job views and optionally job lists. They are also used for the metrics
shown in the polar plot and are available for sorting and filtering jobs.
Metrics configured as footprints are collected as aggregated key:value
pairs
in one JSON object for every job, either on job completion, or during runtime in
configurable intervals. The JSON object itself is written to the database
in a single dedicated column named footprint
.
cluster.json
and migrating the database on this page.
With missing configuration of the footprint
attribute, only existing jobs will show
footprint
data after update and database migration, while subsequently completed jobs
will not be updated due to missing information, and therefore show no footprint data.Moreover, cc-backend
also provides an energy footprint configuration now.
This is a set of metrics that are used to calculate the total energy used by a
job. The metrics used for the energy footprint are also marked using a new
attribute energy
in the cluster metric configurations.
What you need to do
You need to adapt all of your cluster.json
files in the job archive marking
all footprint or energy metrics.
Here is an example how to mark a footprint metric:
{
"name": "fritz",
"metricConfig": [
{
"name": "mem_used",
"unit": {
"base": "B",
"prefix": "G"
},
"scope": "node",
"aggregation": "sum",
"footprint": "max",
"timestep": 60,
"peak": 256,
"normal": 128,
"caution": 200,
"alert": 240,
"lowerIsBetter": true,
"subClusters": [
{
"name": "spr1tb",
"peak": 1024,
"normal": 512,
"caution": 900,
"footprint": "max",
"lowerIsBetter": true,
"alert": 1000
},
{
"name": "spr2tb",
"peak": 2048,
"normal": 1024,
"caution": 1800,
"footprint": "max",
"lowerIsBetter": true,
"alert": 2000
}
]
}
]
}
In case the metrics has subcluster overwrites you currently have to also add the
attributes there. The new attribute footprint
can have avg
, min
, or max
as value indicating what basic statistic over all nodes or cores of a job is
used for this metric. In above example the footprint is the maximum allocated
memory. Because this is (for us) a lower is better metric, this is marked
accordingly using the attribute lowerIsBetter
.
To mark a metric to be used for calculating the total energy you need to add the
energy
attribute.
Example for marking an energy footprint metric:
{
"name": "fritz",
"metricConfig": [
{
"name": "cpu_power",
"unit": {
"base": "W"
},
"scope": "socket",
"aggregation": "sum",
"timestep": 60,
"peak": 500,
"normal": 250,
"caution": 100,
"alert": 50,
"energy": "power"
},
{
"name": "mem_power",
"unit": {
"base": "W"
},
"scope": "socket",
"aggregation": "sum",
"timestep": 60,
"peak": 100,
"normal": 50,
"caution": 20,
"alert": 10,
"energy": "power"
}
]
}
Again you need to add the attribute also to subcluster overwrite in case you
have some. The energy
attribute can have power
or energy
as values. Power
indicates that this metric has Watt as unit and energy is used for metrics that
have Joules as unit. We are aware that we could also already get this
information from the existing metric configuration, but that’s the way it is
currently implemented. Power metrics are converted to Joules using the average
job power and multiplying by the job duration. The total job power is then the
sum over all energy footprint metrics.
The web frontend can also show the CO2 footprint for a job. To enable this you
need to add a new top level configuration key emission-constant
in g/kWh to the
cc-backend
configuration:
{
"emission-constant": 317,
{
After you have marked all metrics you need to raise the job archive version
manually to 2 by editing ./var/job-archive/version.txt
Database migration
This release requires to migrate your database to version 8. Backup your database before migration! Depending on your database size this may take a long time. In our case with a database file size of 50GB it took more than eight hours.
To migrate the database run the following command:
cc-backend -migrate-db
The migration creates the new footprint column and updates its JSON object for existing jobs using the old footprint columns. Moreover it sets the global scope for all existing tags and creates additional indices to speed up common queries.
Configuration changes
You can find a complete configuration example here.
Enable timeseries resampling
ClusterCockpit now supports resampling of time series data to a lower frequency.
This dramatically improves load times for very large or very long jobs and we
recommend to enable it. Resampling is supported for running as well as for
finished jobs. For running jobs this currently only works with the newest
version of cc-metric-store
. Resampling support for the Prometheus time series
database will be added in the future.
To enable resampling you have to add the following toplevel configuration key:
"enable-resampling": {
"trigger": 30,
"resolutions": [
600,
300,
120,
60
]
},
Trigger configures at which minimum number of points in every timeseries plot window the next finer level is loaded. Resolutions defines the resolution steps in seconds. The finest resolution must be the native resolution. In case you have different native solutions in your metric configuration you should use the finest. The implementation will fallback to the finest available resolution in this case.
Continuous scroll is default now
This release includes support for continuous scroll for job lists,
replacing the previous paging ui. Continuous scroll is now the default and you
can remove the ui-defaults
block in case you added it just for enabling
continuous scroll. Every user can overwrite the scrolling option in his
configuration.
Known issues
- Currently energy footprint metrics of type energy are ignored for calculating total energy.
- Resampling for running jobs only works with cc-metric-store
- With energy footprint metrics of type power the unit is ignored and it is assumed the metric has the unit Watt.
3 - Getting Started
The central component of ClusterCockpit is the web- and api backend
cc-backend
. We provide a demo setup that allows you to get an impression of
the web interface. If you just want to try the demo and you have a Linux OS you
can do so using the cc-backend
release binary.
You find detailed instructions on how to setup the demo with the release binary here
If you have a different OS or want to build cc-backend
yourself follow the instructions below.
Prerequisites
To build cc-backend
you need:
- A go compiler, version 1.20 or newer. Most recent os environments should have a package with a recent enough version. On MacOS we recommend to use Homebrew to install on.
- A node.js environment including the
npm
package manager. - A git revision control client.
- For the demo shell script you need
wget
to download the example job archive
Try it out!
All ClusterCockpit components are available within the GitHub ClusterCockpit project.
Clone cc-backend
and change directory into the repository:
git clone https://github.com/ClusterCockpit/cc-backend.git && cd cc-backend
Note
The startDemo script will download a tar file with 38MB (223MB on disk)!Execute the demo start script:
./startDemo.sh
What follows is output from building cc-backend
and downloading the job-archive
HTTP server listening at 127.0.0.1:8080...
Open a web browser and access http://localhost:8080. You should see the ClusterCockpit login page:
Enter demo
for the Username and demo
for the Password and press the Submit button. After that the ClusterCockpit index page should be displayed:
The demo user has the admin role and therefore can see all views.
Note
Because the demo only loads data from the job archive some views as the status and systems view do not work!For details about the features of the web interface have a look at the user guide.
Installation
Setup
Is there any initial setup users need to do after installation to try your project?
3.1 - Demo with release binary
The demo setup with the release binary only works with a Linux system running on a x86-64 processor.
Grab the release binary at GitHub. The following description assumes you perform all tasks from your home folder. Extract the tar archive:
tar xzf cc-backend_Linux_x86_64.tar.gz
Create an empty folder and copy the binary cc-backend
from the extracted archive folder to this folder:
mkdir ./demo
cp cc-backend ./demo
Change to the demo folder and run the following command to setup the required var
directory, initialize the sqlite database, config.json
and .env
files:
./cc-backend -init
Open config.json
in an editor of your choice to edit the existing clusters
name and add a second cluster. Name the clusters fritz
and alex
. The file
should look as below afterwards:
|
|
Download the demo job archive:
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar
Extract the job archive:
tar xf job-archive-demo.tar
Initialize the database using the data from the job archive and create the demo user:
./cc-backend -init-db -add-user demo:admin:demo -loglevel info
Start the web server:
./cc-backend -server -dev -loglevel info
Open a web browser and access http://localhost:8080. You should see the ClusterCockpit login page:
Enter demo
for the Username and demo
for the Password and press the Submit button. After that the ClusterCockpit index page should be displayed:
The demo user has the admin role and therefore can see all views.
Note
Because the demo only loads data from the job archive some views as the status and systems view do not work!For details about the features of the web interface have a look at the user guide.
4 - Tutorials
4.1 - ClusterCockpit installation manual
Introduction
ClusterCockpit requires the following components:
- A node agent running on all compute nodes that measures required metrics and
forwards all data to a time series metrics database. ClusterCockpit provides
its own node agent
cc-metric-collector
. This is the recommended setup, but ClusterCockpit can also be integrated with other node agents, e.g.collectd
,prometheus
ortelegraf
. In this case you have to use it with the accompanying time series database. - A metric time series database. ClusterCockpit provides its own solution
cc-metric-store
, that is the recommended solution. There is also metric store support for Prometheus and InfluxDB. InfluxDB is currently barely tested. Usually only one instance of the time series database is required. - The api and web interface backend
cc-backend
. Only one instance ofcc-backend
is required. This will provide the HTTP server at the desired monitoring URL for serving the web interface. - A SQL database. It is recommended to use the builtin sqlite database for
ClusterCockpit. You can setup LiteStream as a service
which performs a continuous replication of the sqlite database to multiple
storage backends. Optionally
cc-backend
also supports MariaDB/MySQL as SQL database backends. - A batch job scheduler adapter that provides the job meta information to
cc-backend
. This is done by using the provided REST api for starting and stopping jobs. For Slurm there is a Python based solution (cc-slurm-sync ) maintained by PC2 Paderborn is available. For HTCondor there also exists cc-condor-sync.
Server Hardware
cc-backend
is threaded and therefore profits from multiple cores. It does not
require a lot of memory. It is possible to run it in a virtual machine. For best
performance the ./var
folder of cc-backend
which contains the sqlite
database file and the file based job archive should be located on a fast storage
device, ideally a NVMe SSD. The sqlite database file and the job archive will
grow over time (if you are not removing old jobs using a retention policy).
Our setup covering five clusters over 4 years take 50GB for the sqlite database
and around 700GB for the job archive.
cc-metric-store
is also threaded and requires a fixed amount of main memory.
How much depends on your configuration, but 128GB should be enough for most
setups. We run cc-backend
and cc-metric-store
on the same server as
systemd services.
Planning and initial configuration
We recommended the following order for planning and configuring a ClusterCockpit installation:
- Setup your metric list: With two exceptions you are in general free which
metrics you want choose. Those exceptions are:
mem_bw
for main memory bandwidth and ‘flops_any’ for flop throughput (double precision flops are upscaled to single precision rates). You can find a discussion of useful metrics and their naming here. This metric list is an integral component for the configuration of all ClusterCockpit components. - Configure and deploy
cc-metric-store
. - Configure and deploy
cc-metric-collector
. For a detailed description on how to setup cc-metric-collector have a look at /docs/tutorials/prod-ccmc/ - Configure and deploy
cc-backend
- Setup batch job scheduler adapter
Common problems
Up front here is a list with common issues people are facing when installing ClusterCockpit for the first time.
Inconsistent metric names across components
At the moment you need to configure the metric list in every component
separately. In cc-metric-collector
the metrics that are send to the
cc-metric-store
are determined by the collector configuration and possible
renaming in the router configuration. For cc-metric-store
in config.json
you
need to specify a metric list in-order to configure the native metric frequency
and how a metric is aggregated. Metrics that are send to cc-metric-store
and
do not appear in its configuration are silently dropped!
In cc-backend
for every cluster you need to create a cluster.json
configuration in the job-archive. There you setup which metrics are shown in the
web-frontend including many additional properties for the metrics. For running
jobs cc-backend
will query cc-metric-store
for exactly those metric names
and if there is no match there will be an error.
We provide a json schema based specification as part of the job meta and metric
schema. This specification recommends a minimal set of metrics and we suggest to
use the metric names provided there. While it is up to you if you want to adhere
to the metric names suggested in the schema, there are two exceptions: mem_bw
(main memory bandwidth) and flops_any
(total flop rate with DP flops scaled to
SP flops) are required for the roofline plots to work.
Inconsistent device naming between cc-metric-collector
and batch job scheduler adapter
The batch job scheduler adapter (e.g. cc-slurm-sync
) provides a list of
resources that are used by the job. cc-backend
will query cc-metric-store
with exactly those resource ids for getting all metrics for a job.
As a consequence if cc-metric-collector
uses another systematic the metrics
will not be found.
If you have GPU accelerators cc-slurm-sync
should use the PCI-E device
addresses as ids. The option use_pci_info_as_type_id
for the nvidia and
rocm-smi collectors in the collector configuration must be set to true.
To validate and debug problems you can use the cc-metric-store
debug endpoint:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"
This will return the current state of cc-metric-store
. You can search for a
hostname and there scroll for all topology leaf nodes that are available.
Missing nodes in subcluster node lists
ClusterCockpit supports multiple subclusters as part of a cluster. A subcluster
in this context is a homogeneous hardware partition with a dedicated metric
and device configuration. cc-backend
dynamically matches the nodes a job runs
on to subcluster node list to figure out on which subcluster a job is running.
If nodes are missing in a subcluster node list this fails and the metric list
used may be wrong.
4.2 - Decide on metric list
Introduction
To decide on a sensible and meaningful set of metrics is deciding factor for how useful the monitoring will be. As part of a collaborative project several academic HPC centers came up with a minimal set of metrics including their naming. To use a consistent naming is crucial for establishing what metrics mean and we urge you to adhere to the metric names suggested there. You can find this list as part of the ClusterCockpit job data structure JSON schemas.
ClusterCockpit supports multiple clusters within one instance of cc-backend
.
You have to create separate metric lists for each of them. In cc-backend
the
metric lists are provided as part of the cluster configuration. Every cluster is
configured as part of the
job archive using one
cluster.json
file per cluster.
This how-to describes
in-detail how to create a cluster.json
file.
Required Metrics
Flop throughput rate: flops_any
Memory bandwidth: mem_bw
Memory capacity used: mem_used
Requested cpu core utilization: cpu_load
Total fast network bandwidth: net_bw
Total file IO bandwidth: file_bw
Recommended CPU Metrics
Instructions throughput in cycles: ipc
User active CPU core utilization: cpu_user
Double precision flop throughput rate: flops_dp
Single precision flop throughput rate: flops_sp
Average core frequency: clock
CPU power consumption: rapl_power
Recommended GPU Metrics
GPU utilization: acc_used
GPU memory capacity used: acc_mem_used
GPU power consumption: acc_power
Recommended node level metrics
Ethernet read bandwidth: eth_read_bw
Ethernet write bandwidth: eth_write_bw
Fast network read bandwidth: ic_read_bw
Fast network write bandwidth: ic_write_bw
File system metrics
Warning
A file system metric tree is currently not yet supported incc-backend
In the schema a tree of file system metrics is suggested. This allows to provide a similar set of metrics for different file systems used in a cluster. The file system type names suggested are:
- nfs
- lustre
- gpfs
- nvme
- ssd
- hdd
- beegfs
File system read bandwidth: read_bw
File system write bandwidth: write_bw
File system read requests: read_req
File system write requests: write_req
File system inodes used: inodes
File system open and close: accesses
File system file syncs: fsync
File system file creates: create
File system file open: open
File system file close: close
File system file syncs: seek
4.3 - Setup of cc-metric-store
Introduction
4.4 - Setup of cc-metric-collector
Introduction
4.5 - Setup of cc-backend
Introduction
Recommended workflow for deployment
Why we do not provide a docker container
The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.It is recommended to install all ClusterCockpit components in a common
directory, e.g. /opt/monitoring
, var/monitoring
or var/clustercockpit
. In
the following we use /opt/monitoring
.
Two systemd services run on the central monitoring server:
- clustercockpit : binary cc-backend in
/opt/monitoring/cc-backend
. - cc-metric-store : Binary cc-metric-store in
/opt/monitoring/cc-metric-store
.
ClusterCockpit is deployed as a single binary that embeds all static assets.
We recommend keeping all cc-backend
binary versions in a folder archive
and
linking the currently active one from the cc-backend
root.
This allows for easy roll-back in case something doesn’t work.
Please Note
cc-backend
is started with root rights to open the privileged ports (80 and
443). It is recommended to set the configuration options user
and group
, in
which case cc-backend
will drop root permissions once the ports are taken.
You have to take care, that the ownership of the ./var
folder and
its contents are set accordingly.Workflow to deploy new version
This example assumes the DB and job archive versions did not change.
- Stop systemd service:
sudo systemctl stop clustercockpit.service
- Backup the sqlite DB file! This is as simple as to copy it.
- Copy new
cc-backend
binary to/opt/monitoring/cc-backend/archive
(Tip: Use a date tag likeYYYYMMDD-cc-backend
). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
- Link from
cc-backend
root to current version
ln -s /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
- Start systemd service:
sudo systemctl start clustercockpit.service
- Check if everything is ok:
sudo systemctl status clustercockpit.service
- Check log for issues:
sudo journalctl -u clustercockpit.service
- Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!
4.6 - Contribute documentation
We use Hugo to format and generate our website, the Docsy theme for styling and site structure. Hugo is an open-source static site generator that provides us with templates, content organisation in a standard directory structure, and a website generation engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.
All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.
Quick start
Here’s a quick guide to updating the docs. It assumes you’re familiar with the GitHub workflow and you’re happy to use the automated preview of your doc updates:
- Fork the cc-docs repo on GitHub.
- Make your changes and send a pull request (PR).
- If you’re not yet ready for a review, add “WIP” to the PR name to indicate it’s a work in progress.
- Preview the website locally as described beyond.
- Continue updating your doc and pushing your changes until you’re happy with the content.
- When you’re ready for a review, add a comment to the PR, and remove any “WIP” markers.
Updating a single page
If you’ve just spotted something you’d like to change while using the docs, Docsy has a shortcut for you:
- Click Edit this page in the top right hand corner of the page.
- If you don’t already have an up to date fork of the project repo, you are prompted to get one - click Fork this repository and propose changes or Update your Fork to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
Previewing your changes locally
If you want to run your own local Hugo server to preview your changes as you work:
- Follow the instructions in Getting started to install Hugo and any other tools you need. You’ll need at least Hugo version 0.45 (we recommend using the most recent available version), and it must be the extended version, which supports SCSS.
- Fork the cc-docs repo into your own project, then create a local copy using
git clone
. Don’t forget to use--recurse-submodules
or you won’t pull down some of the code you need to generate a working site.
git clone --recurse-submodules --depth 1 https://github.com/ClusterCockpit/cc-doc.git
- Run
hugo server
in the site root directory. By default your site will be available at http://localhost:1313/. Now that you’re serving your site locally, Hugo will watch for changes to the content and automatically refresh your site. - Continue with the usual GitHub workflow to edit files, commit them, push the changes up to your fork, and create a pull request.
Creating an issue
If you’ve found a problem in the docs, but you’re not sure how to fix it yourself, please create an issue in the cc-docs. You can also create an issue about a specific page by clicking the Create Issue button in the top right hand corner of the page.
Useful resources
- Docsy user guide: All about Docsy, including how it manages navigation, look and feel, and multi-language support.
- Hugo documentation: Comprehensive reference for Hugo.
- Github Hello World!: A basic introduction to GitHub concepts and workflow.
5 - How-to Guides
5.1 - Tips for cc-backend frontend development
ClusterCockpit web frontend
The frontend assets including the Svelte js files are per default embedded in
the go binary. To enable a quick turnaround cycle for web development of the
frontend disable embedding of static assets in config.json
:
"embed-static-files": false,
"static-files": "./web/frontend/public/",
Start the node build process (in directory ./web/frontend
) in development mode:
> npm run dev
This will start the build process in listen mode. Whenever you change a source
files the depending javascript targets will be automatically rebuild.
In case the javascript files are minified you may need to set the production
flag by hand to false in ./web/frontend/rollup.config.mjs
:
const production = false
Usually this should work automatically.
Because the files are still served by ./cc-backend you have to reload the view explicitly in your browser.
A common setup is to have three terminals open:
- One running cc-backend (working directory repository root):
./cc-backend -server -dev
- Another running npm in developer mode (working directory
./web/frontend
):npm run dev
- And the last one editing the frontend source files
5.2 - Migrations
Introduction
In general, an upgrade is nothing more than a replacement of the binary file. All the necessary files, except the database file, the configuration file and the job archive, are embedded in the binary file. It is recommended to use a directory where the file names of the binary files are named with a version indicator. This can be, for example, the date or the Unix epoch time. A symbolic link points to the version to be used. This makes it easier to switch to earlier versions.
The database and the job archive are versioned. Each release binary supports specific versions of the database and job archive. If a version mismatch is detected, the application is terminated and migration is required.
IMPORTANT NOTEIt is recommended to make a backup copy of the database before each update. This
is mandatory in case the database needs to be migrated. In the case of sqlite,
this means to stopping cc-backend
and copying the sqlite database file
somewhere.
Migrating the database
After you have backed up the database, run the following command to migrate the database to the latest version:
> ./cc-backend -migrate-db
The migration files are embedded in the binary and can also be viewed in the cc backend source tree. There are separate migration files for both supported database backends. We use the migrate library.
If something goes wrong, you can check the status and get the current schema (here for sqlite):
> sqlite3 var/job.db
In the sqlite console execute:
.schema
to get the current databse schema. You can query the current version and whether the migration failed with:
SELECT * FROM schema_migrations;
The first column indicates the current database version and the second column is a dirty flag indicating whether the migration was successful.
Migrating the job archive
Job archive migration requires a separate tool (archive-migration
), which is
part of the cc-backend source tree (build with go build ./tools/archive-migration
)
and is also provided as part of the releases.
Migration is supported only between two successive releases. The migration tool migrates the existing job archive to a new job archive. This means that there must be enough disk space for two complete job archives. If the tool is called without options:
> ./archive-migration
it is assumed that a job archive exists in ./var/job-archive
. The new job
archive is written to ./var/job-archive-new
. Since execution is threaded in case
of a fatal error, it is impossible to determine in which job the error occurred.
In this case, you can run the tool in debug mode (with the -debug
flag). In
debug mode, threading is disabled and the job ID of each migrated job is output.
Jobs with empty files will be skipped. Between multiple runs of the tools, the
job-archive-new
directory must be moved or deleted.
The cluster.json
files in job-archive-new
must be checked for errors, especially
whether the aggregation attribute is set correctly for all metrics.
Migration takes several hours for relatively large job archives (several hundred GB). A versioned job archive contains a version.txt file in the root directory of the job archive. This file contains the version as an unsigned integer.
5.3 -
5.4 - Hands-On Demo
Prerequisites
- perl
- go
- npm
- Optional: curl
- Script migrateTimestamp.pl
Documentation
You find READMEs or api docs in
- ./cc-backend/configs
- ./cc-backend/init
- ./cc-backend/api
ClusterCockpit configuration files
cc-backend
./.env
Passwords and Tokens set in the environment./config.json
Configuration options for cc-backend
cc-metric-store
./config.json
Optional to overwrite configuration options
cc-metric-collector
Not yet included in the hands-on setup.
Setup Components
Start by creating a base folder for all of the following steps.
mkdir clustercockpit
cd clustercockpit
Setup cc-backend
- Clone Repository
git clone https://github.com/ClusterCockpit/cc-backend.git
cd cc-backend
- Build
make
- Activate & configure environment for cc-backend
cp configs/env-template.txt .env
- Optional: Have a look via
vim .env
- Copy the
config.json
file included in this tarball into the root directory of cc-backend:cp ../../config.json ./
- Back to toplevel
clustercockpit
cd ..
- Prepare Datafolder and Database file
mkdir var
./cc-backend -migrate-db
Setup cc-metric-store
- Clone Repository
git clone https://github.com/ClusterCockpit/cc-metric-store.git
cd cc-metric-store
- Build Go Executable
go get
go build
- Prepare Datafolders
mkdir -p var/checkpoints
mkdir -p var/archive
- Update Config
vim config.json
- Exchange existing setting in
metrics
with the following:
"clock": { "frequency": 60, "aggregation": null },
"cpi": { "frequency": 60, "aggregation": null },
"cpu_load": { "frequency": 60, "aggregation": null },
"flops_any": { "frequency": 60, "aggregation": null },
"flops_dp": { "frequency": 60, "aggregation": null },
"flops_sp": { "frequency": 60, "aggregation": null },
"ib_bw": { "frequency": 60, "aggregation": null },
"lustre_bw": { "frequency": 60, "aggregation": null },
"mem_bw": { "frequency": 60, "aggregation": null },
"mem_used": { "frequency": 60, "aggregation": null },
"rapl_power": { "frequency": 60, "aggregation": null }
- Back to toplevel
clustercockpit
cd ..
Setup Demo Data
mkdir source-data
cd source-data
- Download JobArchive-Source:
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-dev.tar.xz
tar xJf job-archive-dev.tar.xz
mv ./job-archive ./job-archive-source
rm ./job-archive-dev.tar.xz
- Download CC-Metric-Store Checkpoints:
mkdir -p cc-metric-store-source/checkpoints
cd cc-metric-store-source/checkpoints
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
tar xf cc-metric-store-checkpoints.tar.xz
rm cc-metric-store-checkpoints.tar.xz
- Back to
source-data
cd ../..
- Run timestamp migration script. This may take tens of minutes!
cp ../migrateTimestamps.pl .
./migrateTimestamps.pl
- Expected output:
Starting to update start- and stoptimes in job-archive for emmy
Starting to update start- and stoptimes in job-archive for woody
Done for job-archive
Starting to update checkpoint filenames and data starttimes for emmy
Starting to update checkpoint filenames and data starttimes for woody
Done for checkpoints
- Copy
cluster.json
files from source to migrated folderscp source-data/job-archive-source/emmy/cluster.json cc-backend/var/job-archive/emmy/
cp source-data/job-archive-source/woody/cluster.json cc-backend/var/job-archive/woody/
- Initialize Job-Archive in SQLite3 job.db and add demo user
cd cc-backend
./cc-backend -init-db -add-user demo:admin:demo
- Expected output:
<6>[INFO] new user "demo" created (roles: ["admin"], auth-source: 0)
<6>[INFO] Building job table...
<6>[INFO] A total of 3936 jobs have been registered in 1.791 seconds.
- Back to toplevel
clustercockpit
cd ..
Startup both Apps
- In cc-backend root:
$./cc-backend -server -dev
- Starts Clustercockpit at
http:localhost:8080
- Log:
<6>[INFO] HTTP server listening at :8080...
- Log:
- Use local internet browser to access interface
- You should see and be able to browse finished Jobs
- Metadata is read from SQLite3 database
- Metricdata is read from job-archive/JSON-Files
- Create User in settings (top-right corner)
- Name
apiuser
- Username
apiuser
- Role
API
- Submit & Refresh Page
- Name
- Create JTW for
apiuser
- In Userlist, press
Gen. JTW
forapiuser
- Save JWT for later use
- In Userlist, press
- Starts Clustercockpit at
- In cc-metric-store root:
$./cc-metric-store
- Start the cc-metric-store on
http:localhost:8081
, Log:
- Start the cc-metric-store on
2022/07/15 17:17:42 Loading checkpoints newer than 2022-07-13T17:17:42+02:00
2022/07/15 17:17:45 Checkpoints loaded (5621 files, 319 MB, that took 3.034652s)
2022/07/15 17:17:45 API http endpoint listening on '0.0.0.0:8081'
- Does not have a graphical interface
- Otpional: Test function by executing:
$ curl -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw" -D - "http://localhost:8081/api/query" -d "{ \"cluster\": \"emmy\", \"from\": $(expr $(date +%s) - 60), \"to\": $(date +%s), \"queries\": [{
\"metric\": \"flops_any\",
\"host\": \"e1111\"
}] }"
HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 15 Jul 2022 13:57:22 GMT
Content-Length: 119
{"results":[[JSON-DATA-ARRAY]]}
Development API web interfaces
The -dev
flag enables web interfaces to document and test the apis:
- Local GQL Playgorund - A GraphQL playground. To use it you must have a authenticated session in the same browser.
- Local Swagger Docs - A Swagger UI. To use it you have to be logged out, so no user session in the same browser. Use the JWT token with role Api generate previously to authenticate via http header.
Use cc-backend API to start job
Enter the URL
http://localhost:8080/swagger/index.html
in your browser.Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
Click the
/job/start_job
endpoint and click the Try it out button.Enter the following json into the request body text area and fill in a recent start timestamp by executing
date +%s
.:
{
"jobId": 100000,
"arrayJobId": 0,
"user": "ccdemouser",
"subCluster": "main",
"cluster": "emmy",
"startTime": <date +%s>,
"project": "ccdemoproject",
"resources": [
{"hostname": "e0601"},
{"hostname": "e0823"},
{"hostname": "e0337"},
{"hostname": "e1111"}],
"numNodes": 4,
"numHwthreads": 80,
"walltime": 86400
}
- The response body should be the database id of the started job, for example:
{
"id": 3937
}
- Check in ClusterCockpit
- User
ccdemouser
should appear in Users-Tab with one running job - It could take up to 5 Minutes until the Job is displayed with some current data (5 Min Short-Job Filter)
- Job then is marked with a green
running
tag - Metricdata displayed is read from cc-metric-store!
- User
Use cc-backend API to stop job
- Enter the URL
http://localhost:8080/swagger/index.html
in your browser. - Enter your JWT token you generated for the API user by clicking the green Authorize button in the upper right part of the window.
- Click the
/job/stop_job/{id}
endpoint and click the Try it out button. - Enter the database id at id that was returned by
start_job
and copy the following into the request body. Replace the timestamp with a recent one:
{
"cluster": "emmy",
"jobState": "completed",
"stopTime": <RECENT TS>
}
On success a json document with the job meta data is returned.
Check in ClusterCockpit
- User
ccdemouser
should appear in Users-Tab with one completed job - Job is no longer marked with a green
running
tag -> Completed! - Metricdata displayed is now read from job-archive!
- User
Check in job-archive
cd ./cc-backend/var/job-archive/emmy/100/000
cd $STARTTIME
- Inspect
meta.json
anddata.json
Helper scripts
- In this tarball you can find the perl script
generate_subcluster.pl
that helps to generate the subcluster section for your system. Usage: - Log into an exclusive cluster node.
- The LIKWID tools likwid-topology and likwid-bench must be in the PATH!
$./generate_subcluster.pl
outputs the subcluster section onstdout
Please be aware that
- You have to enter the name and node list for the subCluster manually.
- GPU detection only works if LIKWID was build with Cuda avalable and you run likwid-topology also with Cuda loaded.
- Do not blindly trust the measured peakflops values.
- Because the script blindly relies on the CSV format output by likwid-topology this is a fragile undertaking!
5.5 - How to add a notification banner
Overview
To add a notification banner you can add a file notice.txt
to the ./var
directory of the cc-backend
server. As long as this file is present all text
in this file is shown in an info banner on the homepage.
Add notification banner in web interface
As an alternative the admin
role can also add and edit the notification banner
from the settings view.
5.6 - How to create a `cluster.json` file
Overview
Every cluster is configured using a dedicated cluster.json
file, that is part of
the job archive. You can find the JSON schema for it
here.
This file provides information about the homogeneous hardware
partitions within the cluster including the node topology and the metric list.
A real production configuration is provided as part of
cc-examples.
Structure
There are the following main parts:
name
: The name of the clustermetricConfig
: The metric list configurationsubClusters
: Homogeneous hardware partitions in the cluster
The metric configuration
There is one metric list per cluster. You can find a list of recommended metrics and their naming here.
5.7 - How to customize cc-backend
Overview
Customizing cc-backend
means changing the logo, legal texts, and the login
template instead of the placeholders. You can also place a text file in ./var
to add dynamic status or notification messages to the ClusterCockpit homepage.
Replace legal texts
To replace the imprint.tmpl
and privacy.tmpl
legal texts, you can place your
version in ./var/
. At startup cc-backend
will check if ./var/imprint.tmpl
and/or
./var/privacy.tmpl
exist and use them instead of the built-in placeholders.
You can use the placeholders in web/templates
as a blueprint.
Replace login template
To replace the default login layout and styling, you can place your version in
./var/
. At startup cc-backend
will check if ./var/login.tmpl
exist and use
it instead of the built-in placeholder. You can use the default template
web/templates/login.tmpl
as a blueprint.
Replace logo
To change the logo displayed in the navigation bar, you can provide the file
logo.png
in the folder ./var/img/
. On startup cc-backend
will check if the
folder exists and use the images provided there instead of the built-in images.
You may also place additional images there you use in a custom login template.
Add notification banner on homepage
To add a notification banner you can add a file notice.txt
to ./var
. As long
as this file is present all text in this file is shown in an info banner on the
homepage.
5.8 - How to deploy and update cc-backend
Workflow for deployment
Why we do not provide a docker container
The ClusterCockpit web backend binary has no external dependencies, everything is included in the binary. The external assets, SQL database and job archive, would also be external in a docker setup. The only advantage of a docker setup would be that the initial configuration is automated. But this only needs to be done one time. We therefore think that setting up docker, securing and maintaining it is not worth the effort.It is recommended to install all ClusterCockpit components in a common directory, e.g. /opt/monitoring
, var/monitoring
or var/clustercockpit
.
In the following we use /opt/monitoring
.
Two systemd services run on the central monitoring server:
- clustercockpit : binary cc-backend in
/opt/monitoring/cc-backend
. - cc-metric-store : Binary cc-metric-store in
/opt/monitoring/cc-metric-store
.
ClusterCockpit is deployed as a single binary that embeds all static assets.
We recommend keeping all cc-backend
binary versions in a folder archive
and
linking the currently active one from the cc-backend
root.
This allows for easy roll-back in case something doesn’t work.
Please Note
cc-backend
is started with root rights to open the privileged ports (80 and
443). It is recommended to set the configuration options user
and group
, in
which case cc-backend
will drop root permissions once the ports are taken.
You have to take care, that the ownership of the ./var
folder and
its contents are set accordingly.Workflow to update
This example assumes the DB and job archive versions did not change. In case the new binary requires a newer database or job archive version read here how to migrate to newer versions.
- Stop systemd service:
sudo systemctl stop clustercockpit.service
- Backup the sqlite DB file! This is as simple as to copy it.
- Copy new
cc-backend
binary to/opt/monitoring/cc-backend/archive
(Tip: Use a date tag likeYYYYMMDD-cc-backend
). Here is an example:
cp ~/cc-backend /opt/monitoring/cc-backend/archive/20231124-cc-backend
- Link from
cc-backend
root to current version
ln -s /opt/monitoring/cc-backend/archive/20231124-cc-backend /opt/monitoring/cc-backend/cc-backend
- Start systemd service:
sudo systemctl start clustercockpit.service
- Check if everything is ok:
sudo systemctl status clustercockpit.service
- Check log for issues:
sudo journalctl -u clustercockpit.service
- Check the ClusterCockpit web frontend and your Slurm adapters if anything is broken!
5.9 - How to generate JWT tokens
Overview
ClusterCockpit uses JSON Web Tokens (JWT) for authorization of its APIs. JWTs are the industry standard for securing APIs and is also used for example in OAuth2. For details on JWTs refer to the JWT article in the Concepts section.
When a user logs in via the /login
page using a browser, a session cookie
(secured using the random bytes in the SESSION_KEY
env variable you should
change as well in production) is used for all requests after the successful
login. The JWTs make it easier to use the APIs of ClusterCockpit using scripts
or other external programs. The token is specified n the Authorization
HTTP
header using the Bearer schema
(there is an example below). Tokens can be issued to users from the
configuration view in the Web-UI or the command line (using the -jwt <username>
option). In order to use the token for API endpoints such as
/api/jobs/start_job/
, the user that executes it needs to have the api
role.
Regular users can only perform read-only queries and only look at data connected
to jobs they started themselves.
There are two usage scenarios:
- The APIs are used during a browser session. API accesses are authorized with the active session.
- The REST API is used outside a browser session, e.g. by scripts. In this case
you have to issue a token manually. This possible from within the
configuration view or on the command line. It is recommended to issue a JWT
token in this case for a special user that only has the
api
role. By using different users for different purposes a fine grained access control and access revocation management is possible.
The token is commonly specified in the Authorization HTTP header using the
Bearer schema. ClusterCockpit uses a ECDSA private/public keypair to sign and
verify its tokens. You can use cc-backend
to generate new JWT tokens.
Workflow
Create a new ECDSA Public/private key pair for signing and validating tokens
We provide a small utility tool as part of cc-backend
:
go build ./cmd/gen-keypair/
./gen-keypair
Add key pair in your .env
file for cc-backend
An env file template can be found in ./configs
.
cc-backend
requires the private key to sign newly generated JWT tokens and the
public key to validate tokens used to authenticate in its REST APIs.
Generate new JWT token
Every user with the admin role can create or change a user in the configuration view of the web interface. To generate a new JWT for a user just press the GenJWT button behind the user name in the user list.
A new api user and corresponding JWT keys can also be generated from the command line.
Create new API user with admin and api role:
./cc-backend -add-user myapiuser:admin,api:<password>
Create a new JWT token for this user:
./cc-backend -jwt myapiuser
Use issued token token on client side
curl -X GET "<API ENDPOINT>" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer <JWT TOKEN>"
This token can be used for the cc-backend
REST API as well as for the
cc-metric-store
. If you use the token for cc-metric-store
you have to
configure it to use the corresponding public key for validation in its
config.json.
Note
Per default the JWT tokens generated by cc-backend will not expire! To set an expiration date you have to configure an expiration duration inconfig.json
.
You find details here,
use keys jwts
:max-age
.Of course the JWT token can be generated also by other means as long it is
signed with a ED25519 private key and the corresponding public key is configured
in cc-backend
or cc-metric-store
. For the claims that are set and used by
ClusterCockpit refer to the JWT article.
cc-metric-store
The cc-metric-store also
uses JWTs for authentication. As it does not issue new tokens, it does not need
to kown the private key. The public key of the keypair that is used to generate
the JWTs that grant access to the cc-metric-store
can be specified in its
config.json
. When configuring the metricDataRepository
object in the
cluster.json
file of the job-archive, you can put a token issued by
cc-backend
itself.
5.10 - How to prepare a new release
Steps to prepare a release
On
hotfix
branch:- Update ReleaseNotes.md
- Update version in Makefile
- Commit, push, and pull request
- Merge in master
On Linux host:
- Pull master
- Ensure that GitHub Token environment variable
GITHUB_TOKEN
is set - Create release tag:
git tag v1.1.0 -m release
- Execute
goreleaser release
5.11 - How to regenerate the Swagger UI documentation
Overview
This project integrates swagger ui to
document and test its REST API. The swagger documentation files can be found in
./api/
.
Note
To regenerate the Swagger UI files is only required if you change the files./internal/api/rest.go
. Otherwise the Swagger UI will already be correctly
build and is ready to use.Generate Swagger UI files
You can generate the swagger-ui configuration by running the following command from the cc-backend root directory:
go run github.com/swaggo/swag/cmd/swag init -d ./internal/api,./pkg/schema -g rest.go -o ./api
You need to move one generated file:
mv ./api/docs.go ./internal/api/docs.go
Finally rebuild cc-backend
:
make
Use the Swagger UI web interface
If you start cc-backend with the -dev
flag, the Swagger web interface is available
at http://localhost:8080/swagger/.
To use the Try Out functionality, e.g. to test the REST API, you must enter a JWT
key for a user with the API role.
Info
The user who owns the JWT key must not be logged into the same browser (have a valid session), or the Swagger requests will not work. It is recommended to create a separate user that has only the API role.5.12 - How to setup a systemd service
How to run as a systemd service.
The files in this directory assume that you install ClusterCockpit to
/opt/monitoring/cc-backend
.
Of course you can choose any other location, but make sure you replace all paths
starting with /opt/monitoring/cc-backend
in the clustercockpit.service
file!
The config.json
may contain the optional fields user and group. If
specified, the application will call
setuid and
setgid after reading the
config file and binding to a TCP port (so it can take a privileged port), but
before it starts accepting any connections. This is good for security, but also
means that the var/
directory must be readable and writeable by this user.
The .env
and config.json
files may contain secrets and should not be
readable by this user. If these files are changed, the server must be restarted.
- Clone this repository somewhere in your home
git clone git@github.com:ClusterCockpit/cc-backend.git
- (Optional) Install dependencies and build. In general it is recommended to use the provided release binaries.
cd cc-backend && make
Copy the binary to the target folder (adapt if necessary):
sudo mkdir -p /opt/monitoring/cc-backend/
cp ./cc-backend /opt/monitoring/cc-backend/
- Modify the
config.json
andenv-template.txt
file from theconfigs
directory to your liking and put it in the target directory
cp ./configs/config.json /opt/monitoring/config.json && cp ./configs/env-template.txt /opt/monitoring/.env
vim /opt/monitoring/config.json # do your thing...
vim /opt/monitoring/.env # do your thing...
- (Optional) Customization: Add your versions of the login view, legal texts, and logo image. You may use the templates in
./web/templates
as blueprint. Every overwrite is separate.
cp login.tmpl /opt/monitoring/cc-backend/var/
cp imprint.tmpl /opt/monitoring/cc-backend/var/
cp privacy.tmpl /opt/monitoring/cc-backend/var/
# Ensure your logo, and any images you use in your login template has a suitable size.
cp -R img /opt/monitoring/cc-backend/img
- Copy the systemd service unit file. You may adopt it to your needs.
sudo cp ./init/clustercockpit.service /etc/systemd/system/clustercockpit.service
- Enable and start the server
sudo systemctl enable clustercockpit.service # optional (if done, (re-)starts automatically)
sudo systemctl start clustercockpit.service
Check whats going on:
sudo systemctl status clustercockpit.service
sudo journalctl -u clustercockpit.service
5.13 - How to use the Swagger UI documentation
Overview
This project integrates swagger ui to
document and test its REST API.
./api/
.
Access the Swagger UI web interface
If you start cc-backend with the -dev
flag, the Swagger web interface is available
at http://localhost:8080/swagger/.
To use the Try Out functionality, e.g. to test the REST API, you must enter a JWT
key for a user with the API role.
Info
The user who owns the JWT key must not be logged into the same browser (have a valid session), or the Swagger requests will not work. It is recommended to create a separate user that has only the API role.5.14 - Unit tests
Overview
We use the standard golang testing environment.
The following conventions are used:
- White box unit tests: Tests for internal functionality are placed in files
- Black box unit tests: Tests for public interfaces are placed in files
with
<package name>_test.go
and belong to the package<package_name>_test
. There only exists one package test file per package. - Integration tests: Tests that use multiple componenents are placed in a
package test file. These are named
<package name>_test.go
and belong to the package<package_name>_test
. - Test assets: Any required files are placed in a directory
./testdata
within each package directory.
Executing tests
Visual Studio Code has a very good golang test integration. For debugging a test this is the recommended solution.
The Makefile provided by us has a test
target that executes:
> go clean -testcache
> go build ./...
> go vet ./...
> go test ./...
Of course the commands can also be used on the command line. For details about golang testing refer to the standard documentation:
6 - Explanation
6.1 - Authentication
Overview
The authentication is implemented in internal/auth/
. In auth.go
an interface is defined that any authentication provider must fulfill. It also
acts as a dispatcher to delegate the calls to the available authentication
providers.
Two authentication types are available:
- JWT authentication for the REST API that does not create a session cookie
- Session based authentication using a session cookie
The most important routines in auth are:
Login()
Handle POST request to login user and start a new sessionAuth()
Authenticate user and put User Object in context of the request
The http router calls auth in the following cases:
r.Handle("/login", authentication.Login( ... )).Methods(http.MethodPost)
: The POST request on the/login
route will call the Login callback.r.Handle("/jwt-login", authentication.Login( ... ))
: Any request on the/jwt-login
route will call the Login callback. Intended for use for the JWT token based authenticators.- Any route in the secured subrouter will always call Auth(), on success it will call the next handler in the chain, on failure it will render the login template.
secured.Use(func(next http.Handler) http.Handler {
return authentication.Auth(
// On success;
next,
// On failure:
func(rw http.ResponseWriter, r *http.Request, err error) {
// Render login form
})
})
A JWT token can be used to initiate an authenticated user session. This can either happen by calling the login route with a token provided in a header or via a special cookie containing the JWT token. For API routes the access is authenticated on every request using the JWT token and no session is initiated.
Login
The Login function (located in auth.go
):
- Extracts the user name and gets the user from the user database table. In case the user is not found the user object is set to nil.
- Iterates over all authenticators and:
- Calls its
CanLogin
function which checks if the authentication method is supported for this user. - Calls its
Login
function to authenticate the user. On success a valid user object is returned. - Creates a new session object, stores the user attributes in the session and saves the session.
- Starts the
onSuccess
http handler
- Calls its
Local authenticator
This authenticator is applied if
return user != nil && user.AuthSource == AuthViaLocalPassword
Compares the password provided by the login form to the password hash stored in the user database table:
if e := bcrypt.CompareHashAndPassword([]byte(user.Password), []byte(r.FormValue("password"))); e != nil {
log.Errorf("AUTH/LOCAL > Authentication for user %s failed!", user.Username)
return nil, fmt.Errorf("Authentication failed")
}
LDAP authenticator
This authenticator is applied if the user was found in the database and its AuthSource is LDAP:
if user != nil {
if user.AuthSource == schema.AuthViaLDAP {
return user, true
}
}
If the option SyncUserOnLogin
is set it tried to sync the user from the LDAP
directory. In case this succeeds the user is persisted to the database and can
login.
Gets the LDAP connection and tries a bind with the provided credentials:
if err := l.Bind(userDn, r.FormValue("password")); err != nil {
log.Errorf("AUTH/LDAP > Authentication for user %s failed: %v", user.Username, err)
return nil, fmt.Errorf("Authentication failed")
}
JWT Session authenticator
Login via JWT token will create a session without password.
For login the X-Auth-Token
header is not supported. This authenticator is
applied if the Authorization header or query parameter login-token is present:
return user, r.Header.Get("Authorization") != "" ||
r.URL.Query().Get("login-token") != ""
The Login function:
- Parses the token and checks if it is expired
- Check if the signing method is EdDSA or HS256 or HS512
- Check if claims are valid and extracts the claims
- The following claims have to be present:
sub
: The subject, in this case this is the usernameexp
: Expiration in Unix epoch timeroles
: String array with roles of user
- In case user does not exist in the database and the option
SyncUserOnLogin
is set add user to user database table withAuthViaToken
AuthSource. - Return valid user object
JWT Cookie Session authenticator
Login via JWT cookie token will create a session without password. It is first checked if the required configuration options are set:
trustedIssuer
CookieName
and optionally the environment variable CROSS_LOGIN_JWT_PUBLIC_KEY
is set.
This authenticator is applied if the configured cookie is present:
jwtCookie, err := r.Cookie(cookieName)
if err == nil && jwtCookie.Value != "" {
return true
}
The Login function:
- Extracts and parses the token
- Checks if signing method is Ed25519/EdDSA
- In case publicKeyCrossLogin is configured:
- Check if
iss
issuer claim matched trusted issuer from configuration - Return public cross login key
- Otherwise return standard public key
- Check if
- Check if claims are valid
- Depending on the option
validateUser
the roles are extracted from JWT token or taken from user object fetched from database - Ask browser to delete the JWT cookie
- In case user does not exist in the database and the option
SyncUserOnLogin
is set add user to user database table withAuthViaToken
AuthSource. - Return valid user object
Auth
The Auth function (located in auth.go
):
- Returns a new http handler function that is defined right away
- This handler tries two methods to authenticate a user:
- Via a JWT API token in
AuthViaJWT()
- Via a valid session in
AuthViaSession()
- Via a JWT API token in
- If err is not nil and the user object is valid it puts the user object in the request context and starts the onSuccess http handler
- Otherwise it calls the onFailure handler
AuthViaJWT
Implemented in JWTAuthenticator:
- Extract token either from header
X-Auth-Token
orAuthorization
with Bearer prefix - Parse token and check if it is valid. The Parse routine will also check if the token is expired.
- If the option
validateUser
is set it will ensure the user object exists in the database and takes the roles from the database user - Otherwise the roles are extracted from the roles claim
- Returns a valid user object with AuthType set to AuthToken
AuthViaSession
- Extracts session
- Get values username, projects, and roles from session
- Returns a valid user object with AuthType set to AuthSession
6.2 - Configuration Management
Release versions
Versions are marked according to semantic versioning. Each version embeds the following static assets in the binary:
- Web frontend with javascript files and all static assets
- Golang template files for server-side rendering
- JSON schema files for validation
- Database migration files
The remaining external assets are:
- The SQL database used
- The job archive
- The configuration files
config.json
and.env
The external assets are versioned with integer IDs.
This means that each release binary is bound to specific versions of the SQL
database and the job archive.
The configuration file is checked against the current schema at startup.
The -migrate-db
command line switch can be used to migrate the SQL database
from a previous version to the latest one.
We offer a separate tool archive-migration
to migrate an existing job archive
from the previous to the latest version.
Versioning of APIs
cc-backend provides two API backends:
- A REST API for querying jobs.
- A GraphQL API for data exchange between web frontend and cc-backend.
The REST API will also be versioned. We still have to decide whether we will also support older REST API versions by versioning the endpoint URLs. The GraphQL API is for internal use and will not be versioned.
How to build
In general it is recommended to use the provided release binary.
In case you want to build build cc-backend
please always use the provided makefile. This will ensure
that the frontend is also built correctly and that the version in the binary is encoded in the binary.
6.3 - Job Archive
The job archive specifies an exchange format for job meta and performance metric data. It consists of two parts:
- a SQLite database schema for job meta data and performance statistics
- a Json file format together with a Directory hierarchy specification
By using an open, portable and simple specification based on files it is possible to exchange job performance data for research and analysis purposes as well as use it as a robust way for archiving job performance data to disk.
SQLite database schema
Introduction
A SQLite 3 database schema is provided to standardize the job meta data information in a portable way. The schema also includes optional columns for job performance statistics (called a job performance footprint). The database acts as a front end to filter and select subsets of job IDs, that are the keys to get the full job performance data in the job performance tree hierarchy.
Database schema
The schema includes 3 tables: the job table, a tag table and a jobtag table representing the MANY-TO-MANY relation between jobs and tags. The SQL schema is specified here. Explanation of the various columns including the JSON datatypes is documented here.
Directory hierarchy specification
Specification
To manage the number of directories within a single directory a tree approach is used splitting the integer job ID. The job id is split in junks of 1000 each. Usually 2 layers of directories is sufficient but the concept can be used for an arbitrary number of layers.
For a 2 layer schema this can be achieved with (code example in Perl):
$level1 = $jobID/1000;
$level2 = $jobID%1000;
$dstPath = sprintf("%s/%s/%d/%03d", $trunk, $destdir, $level1, $level2);
Example
For the job ID 1034871 the directory path is ./1034/871/
.
Json file format
Overview
Every cluster must be configured in a cluster.json
file.
The job data consists of two files:
meta.json
: Contains job meta information and job statistics.data.json
: Contains complete job data with time series
The description of the json format specification is available as [[json
schema|https://json-schema.org/]] format file. The latest version of the json
schema is part of the cc-backend
source tree. For external reference it is
also available in a separate repository.
Specification cluster.json
The json schema specification in its raw format is available at the GitHub repository. A variant rendered for better readability is found in the references.
Specification meta.json
The json schema specification in its raw format is available at the GitHub repository. A variant rendered for better readability is found in the references.
Specification data.json
The json schema specification in its raw format is available at the GitHub repository. A variant rendered for better readability is found in the references.
Metric time series data is stored for a fixed time step. The time step is set
per metric. If no value is available for a metric time series data timestamp
null
is entered.
6.4 - JSON Web Token
Introduction
ClusterCockpit uses JSON Web Tokens (JWT) for authorization of its APIs.
JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object.
This information can be verified and trusted because it is digitally signed.
In ClusterCockpit JWTs are signed using a public/private key pair using ECDSA.
Because tokens are signed using public/private key pairs, the signature also certifies that only the party holding the private key is the one that signed it.
Expiration of the generated tokens as well as the maximum length of a browser session can be configured in the config.json
file described here.
The Ed25519 algorithm for signatures was used because it is compatible with other tools that require authentication, such as NATS.io, and because these elliptic-curve methods provide simillar security with smaller keys compared to something like RSA. They are sligthly more expensive to validate, but that effect is negligible.
JWT Payload
You may view the payload of a JWT token at https://jwt.io/#debugger-io. Currently ClusterCockpit sets the following claims:
iat
: Issued at claim. The “iat” claim is used to identify the the time at which the JWT was issued. This claim can be used to determine the age of the JWT.sub
: Subject claim. Identifies the subject of the JWT, in our case this is the username.roles
: An array of strings specifying the roles set for the subject.exp
: Expiration date of the token (only if explicitly configured)
It is important to know that JWTs are not encrypted, only signed. This means that outsiders cannot create new JWTs or modify existing ones, but they are able to read out the username.
Accept externally generated JWTs provided via cookie
If there is an external service like an AuthAPI that can generate JWTs and hand them over to ClusterCockpit via cookies, CC can be configured to accept them:
.env
: CC needs a public ed25519 key to verify foreign JWT signatures. Public keys in PEM format can be converted with the instructions in /tools/convert-pem-pubkey-for-cc .
CROSS_LOGIN_JWT_PUBLIC_KEY="+51iXX8BdLFocrppRxIw52xCOf8xFSH/eNilN5IHVGc="
config.json
: Insert a name for the cookie (set by the external service) containing the JWT so that CC knows where to look at. Define a trusted issuer (JWT claim ‘iss’), otherwise it will be rejected. If you want usernames and user roles from JWTs (‘sub’ and ‘roles’ claim) to be validated against CC’s internal database, you need to enable it here. Unknown users will then be rejected and roles set via JWT will be ignored.
"jwts": {
"cookieName": "access_cc",
"forceJWTValidationViaDatabase": true,
"trustedExternalIssuer": "auth.example.com"
}
- Make sure your external service includes the same issuer (
iss
) in its JWTs. Example JWT payload:
{
"iat": 1668161471,
"nbf": 1668161471,
"exp": 1668161531,
"sub": "alice",
"roles": [
"user"
],
"jti": "a1b2c3d4-1234-5678-abcd-a1b2c3d4e5f6",
"iss": "auth.example.com"
}
6.5 - Metric Store
Introduction
CCMS (Cluster Cockpit Metric Store) is a simple in-memory time series database. It stores the data about the nodes in your cluster for a specific interval of days. Data about your nodes can be collected with various instrumentation tools like RAPL, LIKWID, PAPI etc. Instrumentation tools can collect data like memory bandwidth, flops, clock frequency, CPU usage etc. After a specified number of days, the data from the in-memory database will be written to disk, archived and released from the in-memory database. In this documentation, we will explain in-detail working of the CCMS components and the outline of the documentation is as follows:
- Present the structure of the metric store.
- Explain background workers.
Let us get started with the very basic understanding of how CCMS is structured and how it manages data over time.
General tree structure can be as follows:
root
|-----cluster
| |------node -> [node-metrics]
| | |--components -> [node-level-metrics]
| | |--components -> [node-level-metrics]
| |
| |------node -> [node-metrics]
| |--components -> [node-level-metrics]
| |--components -> [node-level-metrics]
|
|-----cluster
|-----node -> [node-metrics]
| |--components -> [node-level-metrics]
| |--components -> [node-level-metrics]
|
|-----node -> [node-metrics]
|--components -> [node-level-metrics]
|--components -> [node-level-metrics]
A simple tree representation with example:
root
|-----alex
| |------a903 -> [mem_cached,cpu_idle,nfs4_read]
| | |--hwthread01 -> [cpu_load,cpu_user,flops_any]
| | |--accelerator01 -> [mem_bw,mem_used,flops_any]
| |
| |------a322 -> [mem_cached,cpu_idle,nfs4_read]
| |--hwthread42 -> [cpu_load,cpu_user,flops_any]
| |--accelerator05 -> [mem_bw,mem_used,flops_any]
|
|-----fritz
|-----f104 -> [mem_cached,cpu_idle,nfs4_read]
| |--hwthread35 -> [cpu_load,cpu_user,flops_any]
| |--socket02 -> [cpu_load,cpu_user,flops_any]
|
|-----f576 -> [mem_cached,cpu_idle,nfs4_read]
|--hwthread47 -> [cpu_load,cpu_user,flops_any]
|--cpu01 -> [cpu_load,cpu_user,flops_any]
Example tree structure of CCMS containing 2 clusters ‘alex’ and ‘fritz’ that contains each of its own nodes and each node contains its components. Each node and its component contains metrics. a903 is an example of a node and hwthread01 & accelerator01 is a node-level component. Each node will have its own metrics as well as node-level components will also have their own metrics i.e. node-level-metrics.
Internal data structures used in cc-metric-store
A representation of the Level and Buffer data structure with the buffer chain.
From our previous example, we move from a simplistic view to a more realistic view. Each buffer for the given metric holds up to BUFFER_CAP elements in its data array. Usually the BUFFER_CAP is 512 elements, so for float64 elements, the buffer size is 4KB, which is also the size of the page in general. Below you can find all the data structures and its associated member variables. In our example, the start time in buffer is exactly 512 epoch seconds apart. Older buffers are pushed to the previous of the new buffer. This creates a chain of buffers for every level.
Data structure used to hold the data in memory:
- MemoryStore
MemoryStore struct {
// Parses and stores the metrics from config.json
Metrics HashMap[string][MetricConfig]
// Initial root level.
root Level
}
- Level
// From our example, alex, fritz, a903, a322, hwthreads01 are all of Level data stucture.
Level struct {
// Stores the metrics for the level.
// From our example, mem_cached, flops_any are of Buffer data structure.
metrics []*buffer
// Stores
children HashMap[string][*Level]
}
- Buffer
buffer struct {
// Pointer to previous buffer
prev *buffer
// Pointer to next buffer
next *buffer
// Array of floats to store
// Interval in seconds at which measurements will arive.
frequency int64
// Buffer's start time stored in epoch seconds
start int
// If true, this buffer will be skipped for file checkpointing
archived bool
closed bool
}
- MetricConfig
MetricConfig struct {
// Interval in seconds at which measurements will arive.
// frequency of 60 means the the timestep/resolution is 60 seconds.
Frequency int
// Can be 'sum', 'avg' or null. Describes how to aggregate metrics from the same timestep over the hierarchy.
Aggregation String
// Private, used internally...
Offset int
}
Background workers
Background workers are separate threads spawned for each background task like:
Data retention -> This background worker uses
retention-on-memory
parameter in theconfig.json
and sets a looping interval for the user-given time. It ticks until the given interval is reached and then releases all the Buffers in CCMS which are less than the user-given time.
In this example, we assume that we insert data continuously in CCMS with retention period of 48 hrs. So the background worker will always check with an interval of retention-period/2. In the example, it is necessary to check every 24 hrs so that the CCMS can retain data of 48 hrs overall. Once it reaches 72 hrs, background worker releases the first 24 hours of data from the in-memory database.
- Data check pointing -> This background worker uses
interval
from thecheckpoints
parameter in theconfig.json
and sets a looping interval for the user-given time. It ticks until the given interval is reached and creates local backups of the data from the CCMS to the disk. The check pointed files can be found at the user-defineddirectory
sub-parameter from thecheckpoints
parameter in theconfig.json
file. Check pointing does not mean removing the data from the in-memory database. The data from the memory will only be released until retention period is reached. - Data archiving -> This background worker uses
interval
from thearchive
parameter in theconfig.json
and sets a looping interval for the user-given time. It ticks until the given interval is reached and zips all the checkpointed files which are before the user-given time in theinterval
sub-parameter. Once the checkpointed files are zipped, they are deleted from the checkpointing directory. - Graceful shutdown handler -> This is a special background worker that detects system or keyboard interrupts like Ctrl+C or Ctrl+Z. In case of an interrupt, it is essential to save the data from the in-memory database. There can be a case when the CCMS contains data just in the memory and it has not been checkpointed. So this background worker scans for the Buffers that have not been checkpointed and writes them to the checkpoint files before shutting down the CCMS.
Reusing the buffers in cc-metric-store
This section explain how CCMS handles the buffer re usability once the buffers are released by the retention background worker.
In this example, we extend the previous example and assume that the retention background worker releases every last buffer from each level i.e. node and node-level metrics. Each buffer that is about to be unlinked from the buffer chain will not be freed from memory, but instead will be unlinked and stored in the memory pool as shown. This allow buffer reusability whenever the buffers reaches the BUFFER_CAP limit and each metric requests new buffers.
6.6 - Roles
ClusterCockpit uses a specified set of user roles to steer data access and discriminate authorizations, primarily used in the web interface for different display of views, but also limiting data access when requests return from the server backend.
The roles currently implemented are:
User Role
The standard role for all users. By default, granted to all users imported from LDAP. It is also the default selection for the administrative “Create User” form.
Use Case: View and list personal jobs, view personal job detail, inspect metrics of personal jobs.
Access: Jobs started from the users account only.
Manager Role
A privileged role for project supervisors. This role has to be granted manually by administrators. If ClusterCockpit is configured to accept JWT logins from external management applications, it is possible to retain roles granted in the respective application, see JWT docs.
In addition to the role itself, one ore more projects need to be assigned to the user by administrators.
Use Case: In addition to personal job access, this role is intended to view and inspect all jobs of all users of the assigned projects (usergroups), in order to self-manage and identify problems of the subordinate user group.
Access: Personally started jobs, regardless of project. Additionally, all jobs started from all users of the assigned projects (usergroups).
Support Role
A privileged role for support staff. This role has to be granted manually by administrators. If ClusterCockpit is configured to accept JWT logins from external management applications, it is possible to retain roles granted in the respective application, see JWT docs.
In regard to job view access, this role is identical to administrators. However, webinterface view access differs and, most importantly, acces to administrative options is prohibited.
Use Case: In addition to personal job access, this role is intended to view and inspect all jobs of all users active on the clusters, in order to identify problems and give guidance for the userbase as a whole, supporting the administrative staff in these tasks.
Access: Personally started jobs, regardless of project. Additionally, all jobs started from all users on all configured clusters.
Administrator Role
The highest available authority for administrative staff only. This role has to be granted manually by other administrators. No JWT can ever grant this role.
All jobs from all active users on all systems can be accessed, as well as all webinterface views. In addition, the administrative options in the settings view are accessible.
Use Case: General access and ClusterCockpit administrative tasks from the settings page.
Access: General access.
API Role
An optional, technical role given to users in order to enable usage of the RESTful API endpoints. This role has to be granted manually by administrators. No JWT can ever grant this role.
This role can either be granted to a specialized “API User”, which does not have a password or any other roles, and therefore, can not log in by itself. Such an user is only intended to be used to generate JWT access tokens for scripted API access, for example.
Still, this role can be granted to actual users, for example, administrators to generate personal API tokens for testing.
Use Case: Interact with ClusterCockpits’ REST API.
Access: Allows usage of ClusterCockpits’ REST API.
7 - Reference
In-depth description of configuration options, file formats, and REST API interfaces.
7.1 - Backend
Reference information regarding the primary ClusterCockpit component “cc-backend” (GitHub Repo).
7.1.1 - Command Line
This page describes the command line options for the cc-backend
executable.
-add-user <username>:[admin,support,manager,api,user]:<password>
Function: Adds a new user to the database. Only one role can be assigned.
Example: -add-user abcduser:manager:somepass
-config <path>
Function: Specifies alternative path to application configuration file.
Default: ./config.json
Example: -config ./configfiles/configuration.json
-del-user <username>
Function: Removes a user from the database by username.
Example: -del-user abcduser
-dev
Function: Enables development components: GraphQL Playground and Swagger UI.
-gops
Function: Go server listens via github.com/google/gops/agent (for debugging).
-import-job <path-to-meta.json>:<path-to-data.json>, ...
Function: Import one or more jobs by comma seperated list of paths to meta.json
and data.json
.
Example: -import-job ./to-import/job1-meta.json:./to-import/job1-data.json,./to-import/job2-meta.json:./to-import/job2-data.json
-init
Function: Setups var
directory. Initializes sqlite database file, config.json
and .env
environment variable file.
-init-db
Function: Iterates the job-archive and re-initializes the ‘job’, ’tag’, and ‘jobtag’ tables based on archived jobs.
-jwt <username>
Function: Generates and prints a JWT for the user specified by its username.
Example: -jwt abcduser
-logdate
Function: Set this flag to add date and time to log messages.
-loglevel <level>
Function: Sets the loglevel of the running ClusterCockpit instance. “Debug” will print all levels, “Crit” will only log critical log messages.
Arguments: debug | info | warn | err | crit
Default: info
Example: -loglevel debug
-migrate-db
Function: Migrate database to latest supported version and exit.
-server
Function: Start a server, continues listening on configured port (Default: :8080
) after initialization and argument handling.
-sync-ldap
Function: Synchronizes the ‘user’ table with LDAP.
-version
Function: Shows version information and exits.
7.1.2 - Configuration
CC-Backend requires a JSON configuration file that specifies the cluster systems to be used. The schema of the configuration is described at the schema documentation.
To override the default, specify the location of a JSON configuration file with the -config <file path>
command line option.
Configuration Options
addr
: Type string. Address where the http (or https) server will listen on (for example: ’localhost:80’). Default:8080
.apiAllowedIPs
: Type array [string]. Addresses from which the secured API endpoints (/users and other auth related endpoints) can be reacheduser
: Type string. Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.group
: Type string. Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.disable-authentication
: Type bool. Disable authentication (for everything: API, Web-UI, …). Defaultfalse
.embed-static-files
: Type bool. If all files inweb/frontend/public
should be served from within the binary itself (they are embedded) or not. Defaulttrue
.static-files
: Type string. Folder where static assets can be found, ifembed-static-files
isfalse
. No default.db-driver
: Type string. ‘sqlite3’ or ‘mysql’ (mysql will work for mariadb as well). Defaultsqlite3
.db
: Type string. For sqlite3 a filename, for mysql a DSN in this format, without query parameters. Default:./var/job.db
.job-archive
: Type object.kind
: Type string. At them moment only file is supported as value.path
: Type string. Path to the job-archive. Default:./var/job-archive
.compression
: Type integer. Setup automatic compression for jobs older than number of days.retention
: Type object.policy
: Type string (required). Retention policy. Possible values none, delete, move.includeDB
: Type bool. Also remove jobs from database.age
: Type integer. Act on jobs with startTime older than age (in days).location
: Type string. The target directory for retention. Only applicable for retention policy move.
disable-archive
: Type bool. Keep all metric data in the metric data repositories, do not write to the job-archive. Defaultfalse
.validate
: Type bool. Validate all input json documents against json schema.ldap
: Type object. For LDAP Authentication and user synchronisation. Defaultnil
.url
: Type string (required). URL of LDAP directory server.user_base
: Type string (required). Base DN of user tree root.search_dn
: Type string (required). DN for authenticating LDAP admin account with general read rights.user_bind
: Type string (required). Expression used to authenticate users via LDAP bind. Must containuid={username}
.user_filter
: Type string (required). Filter to extract users for syncing.username_attr
: Type string. Attribute with full user name. Defaults togecos
if not provided.sync_interval
: Type string. Interval used for syncing local user table with LDAP directory. Parsed using time.ParseDuration.sync_del_old_users
: Type bool. Delete obsolete users in database.syncUserOnLogin
: Type bool. Add non-existent user to DB at login attempt if user exists in Ldap directory.
jwts
: Type object (required). For JWT Authentication.max-age
: Type string (required). Configure how long a token is valid. As string parsable by time.ParseDuration().cookieName
: Type string. Cookie that should be checked for a JWT token.vaidateUser
: Type bool. Deny login for users not in database (but defined in JWT). Overwrite roles in JWT with database roles.trustedIssuer
: Type string. Issuer that should be accepted when validating external JWTs.syncUserOnLogin
: Type bool. Add non-existent user to DB at login attempt with values provided in JWT.updateUserOnLogin
: Type bool. Update existent user in DB at login attempt with values provided in JWT. Currently only the person name is updated.
oidc
: Type object. Defaultnil
.provider
: Type string.syncUserOnLogin
: Type bool. Add non-existent user to DB at login attempt with values provided in JWT.updateUserOnLogin
: Type bool. Update existent user in DB at login attempt with values provided in JWT. Currently only the person name is updated.
session-max-age
: Type string. Specifies for how long a session shall be valid as a string parsable by time.ParseDuration(). If 0 or empty, the session/token does not expire! Default168h
.https-cert-file
andhttps-key-file
: Type string. If both those options are not empty, use HTTPS using those certificates.redirect-http-to
: Type string. If not the empty string andaddr
does not end in “:80”, redirect every request incoming at port 80 to that url.ui-defaults
: Type object. Default configuration for webinterface views. Most options can be overwritten by the user via the web interface. See below for details.enable-resampling
: Type object. If configured, will enable dynamic zoom in frontend metric plots using the configured values.resolutions
: Type array [integer]. Array of resampling target resolutions, in seconds; Example: [600,300,60].trigger
: Type integer. Trigger next zoom level at less than this many visible datapoints.
machine-state-dir
: Type string. Where to store MachineState files. TODO: Explain in more detail!stop-jobs-exceeding-walltime
: Type int. If not zero, automatically mark jobs as stopped running X seconds longer than their walltime. Only applies if walltime is set for job. Default0
.short-running-jobs-duration
: Type int. Do not show running jobs shorter than X seconds. Default300
.emission-constant
: Type integer. Energy Mix CO2 Emission Constant [g/kWh]. If entered, displays estimated CO2 emission for job based on jobs’ totalEnergy.cron-frequency
: Type object. Defines frequency of cron job workers.duration-worker
: Type string. Default:5m
footprint-worker
: Type string. Default:10m
clusters
: Type array [object] (required). Array of clusters.name
: Type string. The name of the cluster.metricDataRepository
: Type object.kind
: Type string. Can be one of [cc-metric-store
,influxdb
].url
: Type string.token
: Type string.
filterRanges
Type object. This option controls the slider ranges for the UI controls of numNodes, duration, and startTime. Example:
"filterRanges": {
"numNodes": { "from": 1, "to": 64 },
"duration": { "from": 0, "to": 86400 },
"startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
}
UI Default Object Fields
analysis_view_histogramMetrics
: Type array [string]. Metrics to show as job count histograms in analysis view. Default["flops_any", "mem_bw", "mem_used"]
.analysis_view_scatterPlotMetrics
: Type array of string array. Initial scatter plot configuration in analysis view. Default[["flops_any", "mem_bw"], ["flops_any", "cpu_load"], ["cpu_load", "mem_bw"]]
.job_view_nodestats_selectedMetrics
: Type array [string]. Initial metrics shown in node statistics table of single job view. Default["flops_any", "mem_bw", "mem_used"]
.job_view_selectedMetrics
: Type array [string]. Default["flops_any", "mem_bw", "mem_used"]
.plot_general_colorBackground
: Type bool. Color plot background according to job average threshold limits. Defaulttrue
.plot_general_colorscheme
: Type array [string]. Initial color scheme. Default"#00bfff", "#0000ff", "#ff00ff", "#ff0000", "#ff8000", "#ffff00", "#80ff00"
.plot_general_lineWidth
: Type int. Initial linewidth. Default3
.plot_list_jobsPerPage
: Type int. Jobs shown per page in job lists. Default50
.plot_list_selectedMetrics
: Type array [string]. Initial metric plots shown in jobs lists. Default"cpu_load", "ipc", "mem_used", "flops_any", "mem_bw"
.plot_view_plotsPerRow
: Type int. Number of plots per row in single job view. Default3
.plot_view_showPolarplot
: Type bool. Option to toggle polar plot in single job view. Defaulttrue
.plot_view_showRoofline
: Type bool. Option to toggle roofline plot in single job view. Defaulttrue
.plot_view_showStatTable
: Type bool. Option to toggle the node statistic table in single job view. Defaulttrue
.system_view_selectedMetric
: Type string. Initial metric shown in system view. Defaultcpu_load
.
Some of the ui-defaults
values can be appended by :<clustername>
in order to have different settings depending on the current cluster. Those are notably job_view_nodestats_selectedMetrics
, job_view_selectedMetrics
and plot_list_selectedMetrics
.
7.1.3 - Environment
All security-related configurations, e.g. keys and passwords, are set using
environment variables. It is supported to set these by means of a .env
file in
the project root.
Environment Variables
An example env file is found in
this directory.
Copy it as .env
into the project root and adapt it for your needs.
JWT_PUBLIC_KEY
andJWT_PRIVATE_KEY
: Base64 encoded Ed25519 keys used for JSON Web Token (JWT) authentication. You can generate your own keypair usinggo run ./tools/gen-keypair/
. The release binaries also include thegen-keypair
tool for x86-64. For more information, see the JWT documentation.SESSION_KEY
: Some random bytes used as secret for cookie-based sessions.LDAP_ADMIN_PASSWORD
: The LDAP admin user password (optional).CROSS_LOGIN_JWT_HS512_KEY
: Used for token based logins via another authentication service.LOGLEVEL
: Can becrit
,err
,warn
,info
ordebug
. Can be used to reduce logging. Default isinfo
.
7.1.4 - REST API
REST API Authorization
In ClusterCockpit JWTs are signed using a public/private key pair using ED25519.
Because tokens are signed using public/private key pairs, the signature also
certifies that only the party holding the private key is the one that signed it.
JWT tokens in ClusterCockpit are not encrypted, means all information is clear
text. Expiration of the generated tokens can be configured in config.json using
the max-age
option in the jwts object. Example:
"jwts": {
"max-age": "168h"
},
The party that generates and signs JWT tokens has to be in possession of the
private key and any party that accepts JWT tokens must possess the public key to
validate it. cc-backed
therefore requires both keys, the private one to
sign generated tokens and the public key to validate tokens that are provided by
REST API clients.
Generate ED25519 key pairs
Usage of Swagger UI
To use the Swagger UI for testing you have to run an instance of cc-backend on localhost (and use the default port 8080):
./cc-backend -server
You may want to start the demo as described here .
This Swagger UI is also available as part of cc-backend
if you start it with
the dev
option:
./cc-backend -server -dev
You may access it at this URL.
Swagger API Reference
Non-Interactive Documentation
This reference is rendered using theswaggerui
plugin based on the original definition file found in the ClusterCockpit repository, but without a serving backend.This means that all interactivity (“Try It Out”) will not return actual data. However, a Curl
call and a compiled Request URL
will still be displayed, if an API endpoint is executed.7.1.5 - Authentication Handbook
Introduction
cc-backend
supports the following authentication methods:
- Local login with credentials stored in SQL database
- Login with authentication to a LDAP directory
- Authentication via JSON Web Token (JWT):
- With token provided in HTML request header
- With token provided in cookie
- Login via OpenID Connect (against a KeyCloak instance)
All above methods create a session cookie that is then used for subsequent authentication of requests. Multiple authentication methods can be configured at the same time. If LDAP is enabled it takes precedence over local authentication. The OpenID Connect method against a KeyCloak instance enables many more authentication methods using the ability of KeyCloak to act as an Identity Broker.
The REST API uses stateless authentication via a JWT token, which means that every requests must be authenticated.
General configuration options
All configuration is part of the cc-backend
configuration file config.json
.
All security sensitive options as passwords and tokens are passed in terms of
environment variables. cc-backend
supports to read an .env
file upon startup
and set the environment variables contained there.
Duration of session
Per default the maximum duration of a session is 7 days. To change this the
option session-max-age
has to be set to a string that can be parsed by the
Golang time.ParseDuration() function.
For most use cases the largest unit h
is the only relevant option.
Example:
"session-max-age": "24h",
To enable unlimited session duration set session-max-age
either to 0 or empty
string.
LDAP authentication
Configuration
To enable LDAP authentication the following set of options are required as
attributes of the ldap
JSON object:
url
: URL of the LDAP directory server. This must be a complete URL including the protocol and not only the host name. Example:ldaps://ldsrv.mydomain.com
.user_base
: Base DN of user tree root. Example:ou=people,ou=users,dc=rz,dc=mydomain,dc=com
.search_dn
: DN for authenticating an LDAP admin account with general read rights. This is required for the sync on login and the sync options. Example:cn=monitoring,ou=adm,ou=profile,ou=manager,dc=rz,dc=mydomain,dc=com
user_bind
: Expression used to authenticate users via LDAP bind. Must containuid={username}
. Example:uid={username},ou=people,ou=users,dc=rz,dc=mydomain,dc=com
.user_filter
: Filter to extract users for syncing. Example:(&(objectclass=posixAccount))
.
Optional configuration options are:
username_attr
: Attribute with full user name. Defaults togecos
if not provided.sync_interval
: Interval used for syncing SQL user table with LDAP directory. Parsed using time.ParseDuration. The sync interval is always relative to the timecc-backend
was started. Example:24h
.sync_del_old_users
: Type boolean. Delete users in SQL database if not in LDAP directory anymore. This of course only applies to users that were added from LDAP.syncUserOnLogin
: Type boolean. Add non-existent user to DB at login attempt if user exists in LDAP directory. This option enables that users can login at once after they are added to the LDAP directory.
The LDAP authentication method requires the environment variable LDAP_ADMIN_PASSWORD
for the search_dn
account that is used to sync users.
Usage
If LDAP is configured it is the first authentication method that is tried if a
user logs in using the login form. A sync with the LDAP directory can also be
triggered from the command line using the flag -sync-ldap
.
Local authentication
No configuration is required for local authentication.
Usage
You can add an user on the command line using the flag -add-user
:
./cc-backend -add-user <username>:<roles>:<password>
Example:
./cc-backend -add-user fritz:admin,api:myPass
Roles can be admin, support, manager, api, and user.
Users can be deleted using the flag -del-user
:
./cc-backend -del-user fritz
Warning
The option-del-user
as currently implemented will delete ALL users that
match the username independent of its origin. This means it will also delete
user records that were added from LDAP or JWT tokens.JWT token authentication
JSON web tokens are a standardized method for representing claims securely between two parties. In ClusterCockpit they are used for authorization to use REST APIs as well as a method to delegate authentication to a third party.
Configuration
Authorization control
cc-backend
uses roles to decide if a user is authorized to access certain
information. The roles and their rights are described in more detail here.
7.1.6 - Job Archive Handbook
7.1.7 - Schemas
ClusterCockpit Schema References for
- Application Configuration
- Cluster Configuration
- Job Data
- Job Statistics
- Units
- Job Archive Job Metadata
- Job Archive Job Metricdata
The schemas in their raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schemas found in the repository are not automatically rendered in this reference documentation.The raw JSON schemas are parsed and rendered for better readability using the json-schema-for-humans utility.Last Update: 04.12.20247.1.7.1 - Application Config Schema
A detailed description of each of the application configuration options can be found in the config documentation.
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024cc-backend configuration file schema
- 1. Property
cc-backend configuration file schema > addr
- 2. Property
cc-backend configuration file schema > apiAllowedIPs
- 3. Property
cc-backend configuration file schema > user
- 4. Property
cc-backend configuration file schema > group
- 5. Property
cc-backend configuration file schema > disable-authentication
- 6. Property
cc-backend configuration file schema > embed-static-files
- 7. Property
cc-backend configuration file schema > static-files
- 8. Property
cc-backend configuration file schema > db-driver
- 9. Property
cc-backend configuration file schema > db
- 10. Property
cc-backend configuration file schema > archive
- 10.1. Property
cc-backend configuration file schema > archive > kind
- 10.2. Property
cc-backend configuration file schema > archive > path
- 10.3. Property
cc-backend configuration file schema > archive > compression
- 10.4. Property
cc-backend configuration file schema > archive > retention
- 10.4.1. Property
cc-backend configuration file schema > archive > retention > policy
- 10.4.2. Property
cc-backend configuration file schema > archive > retention > includeDB
- 10.4.3. Property
cc-backend configuration file schema > archive > retention > age
- 10.4.4. Property
cc-backend configuration file schema > archive > retention > location
- 10.4.1. Property
- 10.1. Property
- 11. Property
cc-backend configuration file schema > disable-archive
- 12. Property
cc-backend configuration file schema > validate
- 13. Property
cc-backend configuration file schema > session-max-age
- 14. Property
cc-backend configuration file schema > https-cert-file
- 15. Property
cc-backend configuration file schema > https-key-file
- 16. Property
cc-backend configuration file schema > redirect-http-to
- 17. Property
cc-backend configuration file schema > stop-jobs-exceeding-walltime
- 18. Property
cc-backend configuration file schema > short-running-jobs-duration
- 19. Property
cc-backend configuration file schema > emission-constant
- 20. Property
cc-backend configuration file schema > cron-frequency
- 21. Property
cc-backend configuration file schema > enable-resampling
- 22. Property
cc-backend configuration file schema > jwts
- 22.1. Property
cc-backend configuration file schema > jwts > max-age
- 22.2. Property
cc-backend configuration file schema > jwts > cookieName
- 22.3. Property
cc-backend configuration file schema > jwts > validateUser
- 22.4. Property
cc-backend configuration file schema > jwts > trustedIssuer
- 22.5. Property
cc-backend configuration file schema > jwts > syncUserOnLogin
- 22.1. Property
- 23. Property
cc-backend configuration file schema > oidc
- 24. Property
cc-backend configuration file schema > ldap
- 24.1. Property
cc-backend configuration file schema > ldap > url
- 24.2. Property
cc-backend configuration file schema > ldap > user_base
- 24.3. Property
cc-backend configuration file schema > ldap > search_dn
- 24.4. Property
cc-backend configuration file schema > ldap > user_bind
- 24.5. Property
cc-backend configuration file schema > ldap > user_filter
- 24.6. Property
cc-backend configuration file schema > ldap > username_attr
- 24.7. Property
cc-backend configuration file schema > ldap > sync_interval
- 24.8. Property
cc-backend configuration file schema > ldap > sync_del_old_users
- 24.9. Property
cc-backend configuration file schema > ldap > syncUserOnLogin
- 24.1. Property
- 25. Property
cc-backend configuration file schema > clusters
- 25.1. cc-backend configuration file schema > clusters > clusters items
- 25.1.1. Property
cc-backend configuration file schema > clusters > clusters items > name
- 25.1.2. Property
cc-backend configuration file schema > clusters > clusters items > metricDataRepository
- 25.1.2.1. Property
cc-backend configuration file schema > clusters > clusters items > metricDataRepository > kind
- 25.1.2.2. Property
cc-backend configuration file schema > clusters > clusters items > metricDataRepository > url
- 25.1.2.3. Property
cc-backend configuration file schema > clusters > clusters items > metricDataRepository > token
- 25.1.2.1. Property
- 25.1.3. Property
cc-backend configuration file schema > clusters > clusters items > filterRanges
- 25.1.3.1. Property
cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes
- 25.1.3.2. Property
cc-backend configuration file schema > clusters > clusters items > filterRanges > duration
- 25.1.3.3. Property
cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime
- 25.1.3.1. Property
- 25.1.1. Property
- 25.1. cc-backend configuration file schema > clusters > clusters items
- 26. Property
cc-backend configuration file schema > ui-defaults
- 26.1. Property
cc-backend configuration file schema > ui-defaults > plot_general_colorBackground
- 26.2. Property
cc-backend configuration file schema > ui-defaults > plot_general_lineWidth
- 26.3. Property
cc-backend configuration file schema > ui-defaults > plot_list_jobsPerPage
- 26.4. Property
cc-backend configuration file schema > ui-defaults > plot_view_plotsPerRow
- 26.5. Property
cc-backend configuration file schema > ui-defaults > plot_view_showPolarplot
- 26.6. Property
cc-backend configuration file schema > ui-defaults > plot_view_showRoofline
- 26.7. Property
cc-backend configuration file schema > ui-defaults > plot_view_showStatTable
- 26.8. Property
cc-backend configuration file schema > ui-defaults > system_view_selectedMetric
- 26.9. Property
cc-backend configuration file schema > ui-defaults > job_view_showFootprint
- 26.10. Property
cc-backend configuration file schema > ui-defaults > job_list_usePaging
- 26.11. Property
cc-backend configuration file schema > ui-defaults > analysis_view_histogramMetrics
- 26.12. Property
cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics
- 26.13. Property
cc-backend configuration file schema > ui-defaults > job_view_nodestats_selectedMetrics
- 26.14. Property
cc-backend configuration file schema > ui-defaults > job_view_selectedMetrics
- 26.15. Property
cc-backend configuration file schema > ui-defaults > plot_general_colorscheme
- 26.16. Property
cc-backend configuration file schema > ui-defaults > plot_list_selectedMetrics
- 26.1. Property
Title: cc-backend configuration file schema
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- addr | No | string | No | - | Address where the http (or https) server will listen on (for example: ’localhost:80’). |
- apiAllowedIPs | No | array of string | No | - | Addresses from which secured API endpoints can be reached |
- user | No | string | No | - | Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port. |
- group | No | string | No | - | Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port. |
- disable-authentication | No | boolean | No | - | Disable authentication (for everything: API, Web-UI, …). |
- embed-static-files | No | boolean | No | - | If all files in `web/frontend/public` should be served from within the binary itself (they are embedded) or not. |
- static-files | No | string | No | - | Folder where static assets can be found, if embed-static-files is false. |
- db-driver | No | enum (of string) | No | - | sqlite3 or mysql (mysql will work for mariadb as well). |
- db | No | string | No | - | For sqlite3 a filename, for mysql a DSN in this format: https://github.com/go-sql-driver/mysql#dsn-data-source-name (Without query parameters!). |
- archive | No | object | No | - | Configuration keys for job-archive |
- disable-archive | No | boolean | No | - | Keep all metric data in the metric data repositories, do not write to the job-archive. |
- validate | No | boolean | No | - | Validate all input json documents against json schema. |
- session-max-age | No | string | No | - | Specifies for how long a session shall be valid as a string parsable by time.ParseDuration(). If 0 or empty, the session/token does not expire! |
- https-cert-file | No | string | No | - | Filepath to SSL certificate. If also https-key-file is set use HTTPS using those certificates. |
- https-key-file | No | string | No | - | Filepath to SSL key file. If also https-cert-file is set use HTTPS using those certificates. |
- redirect-http-to | No | string | No | - | If not the empty string and addr does not end in :80, redirect every request incoming at port 80 to that url. |
- stop-jobs-exceeding-walltime | No | integer | No | - | If not zero, automatically mark jobs as stopped running X seconds longer than their walltime. Only applies if walltime is set for job. |
- short-running-jobs-duration | No | integer | No | - | Do not show running jobs shorter than X seconds. |
- emission-constant | No | integer | No | - | . |
- cron-frequency | No | object | No | - | Frequency of cron job workers. |
- enable-resampling | No | object | No | - | Enable dynamic zoom in frontend metric plots. |
+ jwts | No | object | No | - | For JWT token authentication. |
- oidc | No | object | No | - | - |
- ldap | No | object | No | - | For LDAP Authentication and user synchronisation. |
+ clusters | No | array of object | No | - | Configuration for the clusters to be displayed. |
- ui-defaults | No | object | No | - | Default configuration for web UI |
1. Property cc-backend configuration file schema > addr
Type | string |
Required | No |
Description: Address where the http (or https) server will listen on (for example: ’localhost:80’).
2. Property cc-backend configuration file schema > apiAllowedIPs
Type | array of string |
Required | No |
Description: Addresses from which secured API endpoints can be reached
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
apiAllowedIPs items | - |
2.1. cc-backend configuration file schema > apiAllowedIPs > apiAllowedIPs items
Type | string |
Required | No |
3. Property cc-backend configuration file schema > user
Type | string |
Required | No |
Description: Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.
4. Property cc-backend configuration file schema > group
Type | string |
Required | No |
Description: Drop root permissions once .env was read and the port was taken. Only applicable if using privileged port.
5. Property cc-backend configuration file schema > disable-authentication
Type | boolean |
Required | No |
Description: Disable authentication (for everything: API, Web-UI, …).
6. Property cc-backend configuration file schema > embed-static-files
Type | boolean |
Required | No |
Description: If all files in web/frontend/public
should be served from within the binary itself (they are embedded) or not.
7. Property cc-backend configuration file schema > static-files
Type | string |
Required | No |
Description: Folder where static assets can be found, if embed-static-files is false.
8. Property cc-backend configuration file schema > db-driver
Type | enum (of string) |
Required | No |
Description: sqlite3 or mysql (mysql will work for mariadb as well).
Must be one of:
- “sqlite3”
- “mysql”
9. Property cc-backend configuration file schema > db
Type | string |
Required | No |
Description: For sqlite3 a filename, for mysql a DSN in this format: https://github.com/go-sql-driver/mysql#dsn-data-source-name (Without query parameters!).
10. Property cc-backend configuration file schema > archive
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Configuration keys for job-archive
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ kind | No | enum (of string) | No | - | Backend type for job-archive |
- path | No | string | No | - | Path to job archive for file backend |
- compression | No | integer | No | - | Setup automatic compression for jobs older than number of days |
- retention | No | object | No | - | Configuration keys for retention |
10.1. Property cc-backend configuration file schema > archive > kind
Type | enum (of string) |
Required | Yes |
Description: Backend type for job-archive
Must be one of:
- “file”
- “s3”
10.2. Property cc-backend configuration file schema > archive > path
Type | string |
Required | No |
Description: Path to job archive for file backend
10.3. Property cc-backend configuration file schema > archive > compression
Type | integer |
Required | No |
Description: Setup automatic compression for jobs older than number of days
10.4. Property cc-backend configuration file schema > archive > retention
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Configuration keys for retention
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ policy | No | enum (of string) | No | - | Retention policy |
- includeDB | No | boolean | No | - | Also remove jobs from database |
- age | No | integer | No | - | Act on jobs with startTime older than age (in days) |
- location | No | string | No | - | The target directory for retention. Only applicable for retention move. |
10.4.1. Property cc-backend configuration file schema > archive > retention > policy
Type | enum (of string) |
Required | Yes |
Description: Retention policy
Must be one of:
- “none”
- “delete”
- “move”
10.4.2. Property cc-backend configuration file schema > archive > retention > includeDB
Type | boolean |
Required | No |
Description: Also remove jobs from database
10.4.3. Property cc-backend configuration file schema > archive > retention > age
Type | integer |
Required | No |
Description: Act on jobs with startTime older than age (in days)
10.4.4. Property cc-backend configuration file schema > archive > retention > location
Type | string |
Required | No |
Description: The target directory for retention. Only applicable for retention move.
11. Property cc-backend configuration file schema > disable-archive
Type | boolean |
Required | No |
Description: Keep all metric data in the metric data repositories, do not write to the job-archive.
12. Property cc-backend configuration file schema > validate
Type | boolean |
Required | No |
Description: Validate all input json documents against json schema.
13. Property cc-backend configuration file schema > session-max-age
Type | string |
Required | No |
Description: Specifies for how long a session shall be valid as a string parsable by time.ParseDuration(). If 0 or empty, the session/token does not expire!
14. Property cc-backend configuration file schema > https-cert-file
Type | string |
Required | No |
Description: Filepath to SSL certificate. If also https-key-file is set use HTTPS using those certificates.
15. Property cc-backend configuration file schema > https-key-file
Type | string |
Required | No |
Description: Filepath to SSL key file. If also https-cert-file is set use HTTPS using those certificates.
16. Property cc-backend configuration file schema > redirect-http-to
Type | string |
Required | No |
Description: If not the empty string and addr does not end in :80, redirect every request incoming at port 80 to that url.
17. Property cc-backend configuration file schema > stop-jobs-exceeding-walltime
Type | integer |
Required | No |
Description: If not zero, automatically mark jobs as stopped running X seconds longer than their walltime. Only applies if walltime is set for job.
18. Property cc-backend configuration file schema > short-running-jobs-duration
Type | integer |
Required | No |
Description: Do not show running jobs shorter than X seconds.
19. Property cc-backend configuration file schema > emission-constant
Type | integer |
Required | No |
Description: .
20. Property cc-backend configuration file schema > cron-frequency
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Frequency of cron job workers.
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- duration-worker | No | string | No | - | Duration Update Worker [Defaults to ‘5m’] |
- footprint-worker | No | string | No | - | Metric-Footprint Update Worker [Defaults to ‘10m’] |
20.1. Property cc-backend configuration file schema > cron-frequency > duration-worker
Type | string |
Required | No |
Description: Duration Update Worker [Defaults to ‘5m’]
20.2. Property cc-backend configuration file schema > cron-frequency > footprint-worker
Type | string |
Required | No |
Description: Metric-Footprint Update Worker [Defaults to ‘10m’]
21. Property cc-backend configuration file schema > enable-resampling
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Enable dynamic zoom in frontend metric plots.
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ trigger | No | integer | No | - | Trigger next zoom level at less than this many visible datapoints. |
+ resolutions | No | array of integer | No | - | Array of resampling target resolutions, in seconds. |
21.1. Property cc-backend configuration file schema > enable-resampling > trigger
Type | integer |
Required | Yes |
Description: Trigger next zoom level at less than this many visible datapoints.
21.2. Property cc-backend configuration file schema > enable-resampling > resolutions
Type | array of integer |
Required | Yes |
Description: Array of resampling target resolutions, in seconds.
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
resolutions items | - |
21.2.1. cc-backend configuration file schema > enable-resampling > resolutions > resolutions items
Type | integer |
Required | No |
22. Property cc-backend configuration file schema > jwts
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: For JWT token authentication.
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ max-age | No | string | No | - | Configure how long a token is valid. As string parsable by time.ParseDuration() |
- cookieName | No | string | No | - | Cookie that should be checked for a JWT token. |
- validateUser | No | boolean | No | - | Deny login for users not in database (but defined in JWT). Overwrite roles in JWT with database roles. |
- trustedIssuer | No | string | No | - | Issuer that should be accepted when validating external JWTs |
- syncUserOnLogin | No | boolean | No | - | Add non-existent user to DB at login attempt with values provided in JWT. |
22.1. Property cc-backend configuration file schema > jwts > max-age
Type | string |
Required | Yes |
Description: Configure how long a token is valid. As string parsable by time.ParseDuration()
22.2. Property cc-backend configuration file schema > jwts > cookieName
Type | string |
Required | No |
Description: Cookie that should be checked for a JWT token.
22.3. Property cc-backend configuration file schema > jwts > validateUser
Type | boolean |
Required | No |
Description: Deny login for users not in database (but defined in JWT). Overwrite roles in JWT with database roles.
22.4. Property cc-backend configuration file schema > jwts > trustedIssuer
Type | string |
Required | No |
Description: Issuer that should be accepted when validating external JWTs
22.5. Property cc-backend configuration file schema > jwts > syncUserOnLogin
Type | boolean |
Required | No |
Description: Add non-existent user to DB at login attempt with values provided in JWT.
23. Property cc-backend configuration file schema > oidc
Type | object |
Required | No |
Additional properties | Any type allowed |
23.1. The following properties are required
- provider
24. Property cc-backend configuration file schema > ldap
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: For LDAP Authentication and user synchronisation.
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ url | No | string | No | - | URL of LDAP directory server. |
+ user_base | No | string | No | - | Base DN of user tree root. |
+ search_dn | No | string | No | - | DN for authenticating LDAP admin account with general read rights. |
+ user_bind | No | string | No | - | Expression used to authenticate users via LDAP bind. Must contain uid={username}. |
+ user_filter | No | string | No | - | Filter to extract users for syncing. |
- username_attr | No | string | No | - | Attribute with full username. Default: gecos |
- sync_interval | No | string | No | - | Interval used for syncing local user table with LDAP directory. Parsed using time.ParseDuration. |
- sync_del_old_users | No | boolean | No | - | Delete obsolete users in database. |
- syncUserOnLogin | No | boolean | No | - | Add non-existent user to DB at login attempt if user exists in Ldap directory |
24.1. Property cc-backend configuration file schema > ldap > url
Type | string |
Required | Yes |
Description: URL of LDAP directory server.
24.2. Property cc-backend configuration file schema > ldap > user_base
Type | string |
Required | Yes |
Description: Base DN of user tree root.
24.3. Property cc-backend configuration file schema > ldap > search_dn
Type | string |
Required | Yes |
Description: DN for authenticating LDAP admin account with general read rights.
24.4. Property cc-backend configuration file schema > ldap > user_bind
Type | string |
Required | Yes |
Description: Expression used to authenticate users via LDAP bind. Must contain uid={username}.
24.5. Property cc-backend configuration file schema > ldap > user_filter
Type | string |
Required | Yes |
Description: Filter to extract users for syncing.
24.6. Property cc-backend configuration file schema > ldap > username_attr
Type | string |
Required | No |
Description: Attribute with full username. Default: gecos
24.7. Property cc-backend configuration file schema > ldap > sync_interval
Type | string |
Required | No |
Description: Interval used for syncing local user table with LDAP directory. Parsed using time.ParseDuration.
24.8. Property cc-backend configuration file schema > ldap > sync_del_old_users
Type | boolean |
Required | No |
Description: Delete obsolete users in database.
24.9. Property cc-backend configuration file schema > ldap > syncUserOnLogin
Type | boolean |
Required | No |
Description: Add non-existent user to DB at login attempt if user exists in Ldap directory
25. Property cc-backend configuration file schema > clusters
Type | array of object |
Required | Yes |
Description: Configuration for the clusters to be displayed.
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
clusters items | - |
25.1. cc-backend configuration file schema > clusters > clusters items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | The name of the cluster. |
+ metricDataRepository | No | object | No | - | Type of the metric data repository for this cluster |
+ filterRanges | No | object | No | - | This option controls the slider ranges for the UI controls of numNodes, duration, and startTime. |
25.1.1. Property cc-backend configuration file schema > clusters > clusters items > name
Type | string |
Required | Yes |
Description: The name of the cluster.
25.1.2. Property cc-backend configuration file schema > clusters > clusters items > metricDataRepository
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Type of the metric data repository for this cluster
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ kind | No | enum (of string) | No | - | - |
+ url | No | string | No | - | - |
- token | No | string | No | - | - |
25.1.2.1. Property cc-backend configuration file schema > clusters > clusters items > metricDataRepository > kind
Type | enum (of string) |
Required | Yes |
Must be one of:
- “influxdb”
- “prometheus”
- “cc-metric-store”
- “test”
25.1.2.2. Property cc-backend configuration file schema > clusters > clusters items > metricDataRepository > url
Type | string |
Required | Yes |
25.1.2.3. Property cc-backend configuration file schema > clusters > clusters items > metricDataRepository > token
Type | string |
Required | No |
25.1.3. Property cc-backend configuration file schema > clusters > clusters items > filterRanges
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: This option controls the slider ranges for the UI controls of numNodes, duration, and startTime.
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ numNodes | No | object | No | - | UI slider range for number of nodes |
+ duration | No | object | No | - | UI slider range for duration |
+ startTime | No | object | No | - | UI slider range for start time |
25.1.3.1. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: UI slider range for number of nodes
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ from | No | integer | No | - | - |
+ to | No | integer | No | - | - |
25.1.3.1.1. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes > from
Type | integer |
Required | Yes |
25.1.3.1.2. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > numNodes > to
Type | integer |
Required | Yes |
25.1.3.2. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > duration
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: UI slider range for duration
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ from | No | integer | No | - | - |
+ to | No | integer | No | - | - |
25.1.3.2.1. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > duration > from
Type | integer |
Required | Yes |
25.1.3.2.2. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > duration > to
Type | integer |
Required | Yes |
25.1.3.3. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: UI slider range for start time
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ from | No | string | No | - | - |
+ to | No | null | No | - | - |
25.1.3.3.1. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime > from
Type | string |
Required | Yes |
Format | date-time |
25.1.3.3.2. Property cc-backend configuration file schema > clusters > clusters items > filterRanges > startTime > to
Type | null |
Required | Yes |
26. Property cc-backend configuration file schema > ui-defaults
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Default configuration for web UI
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ plot_general_colorBackground | No | boolean | No | - | Color plot background according to job average threshold limits |
+ plot_general_lineWidth | No | integer | No | - | Initial linewidth |
+ plot_list_jobsPerPage | No | integer | No | - | Jobs shown per page in job lists |
+ plot_view_plotsPerRow | No | integer | No | - | Number of plots per row in single job view |
+ plot_view_showPolarplot | No | boolean | No | - | Option to toggle polar plot in single job view |
+ plot_view_showRoofline | No | boolean | No | - | Option to toggle roofline plot in single job view |
+ plot_view_showStatTable | No | boolean | No | - | Option to toggle the node statistic table in single job view |
+ system_view_selectedMetric | No | string | No | - | Initial metric shown in system view |
+ job_view_showFootprint | No | boolean | No | - | Option to toggle footprint ui in single job view |
+ job_list_usePaging | No | boolean | No | - | Option to switch from continous scroll to paging |
+ analysis_view_histogramMetrics | No | array of string | No | - | Metrics to show as job count histograms in analysis view |
+ analysis_view_scatterPlotMetrics | No | array of array | No | - | Initial scatter plto configuration in analysis view |
+ job_view_nodestats_selectedMetrics | No | array of string | No | - | Initial metrics shown in node statistics table of single job view |
+ job_view_selectedMetrics | No | array of string | No | - | - |
+ plot_general_colorscheme | No | array of string | No | - | Initial color scheme |
+ plot_list_selectedMetrics | No | array of string | No | - | Initial metric plots shown in jobs lists |
26.1. Property cc-backend configuration file schema > ui-defaults > plot_general_colorBackground
Type | boolean |
Required | Yes |
Description: Color plot background according to job average threshold limits
26.2. Property cc-backend configuration file schema > ui-defaults > plot_general_lineWidth
Type | integer |
Required | Yes |
Description: Initial linewidth
26.3. Property cc-backend configuration file schema > ui-defaults > plot_list_jobsPerPage
Type | integer |
Required | Yes |
Description: Jobs shown per page in job lists
26.4. Property cc-backend configuration file schema > ui-defaults > plot_view_plotsPerRow
Type | integer |
Required | Yes |
Description: Number of plots per row in single job view
26.5. Property cc-backend configuration file schema > ui-defaults > plot_view_showPolarplot
Type | boolean |
Required | Yes |
Description: Option to toggle polar plot in single job view
26.6. Property cc-backend configuration file schema > ui-defaults > plot_view_showRoofline
Type | boolean |
Required | Yes |
Description: Option to toggle roofline plot in single job view
26.7. Property cc-backend configuration file schema > ui-defaults > plot_view_showStatTable
Type | boolean |
Required | Yes |
Description: Option to toggle the node statistic table in single job view
26.8. Property cc-backend configuration file schema > ui-defaults > system_view_selectedMetric
Type | string |
Required | Yes |
Description: Initial metric shown in system view
26.9. Property cc-backend configuration file schema > ui-defaults > job_view_showFootprint
Type | boolean |
Required | Yes |
Description: Option to toggle footprint ui in single job view
26.10. Property cc-backend configuration file schema > ui-defaults > job_list_usePaging
Type | boolean |
Required | Yes |
Description: Option to switch from continous scroll to paging
26.11. Property cc-backend configuration file schema > ui-defaults > analysis_view_histogramMetrics
Type | array of string |
Required | Yes |
Description: Metrics to show as job count histograms in analysis view
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
analysis_view_histogramMetrics items | - |
26.11.1. cc-backend configuration file schema > ui-defaults > analysis_view_histogramMetrics > analysis_view_histogramMetrics items
Type | string |
Required | No |
26.12. Property cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics
Type | array of array |
Required | Yes |
Description: Initial scatter plto configuration in analysis view
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
analysis_view_scatterPlotMetrics items | - |
26.12.1. cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics > analysis_view_scatterPlotMetrics items
Type | array of string |
Required | No |
Array restrictions | |
---|---|
Min items | 1 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
analysis_view_scatterPlotMetrics items items | - |
26.12.1.1. cc-backend configuration file schema > ui-defaults > analysis_view_scatterPlotMetrics > analysis_view_scatterPlotMetrics items > analysis_view_scatterPlotMetrics items items
Type | string |
Required | No |
26.13. Property cc-backend configuration file schema > ui-defaults > job_view_nodestats_selectedMetrics
Type | array of string |
Required | Yes |
Description: Initial metrics shown in node statistics table of single job view
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
job_view_nodestats_selectedMetrics items | - |
26.13.1. cc-backend configuration file schema > ui-defaults > job_view_nodestats_selectedMetrics > job_view_nodestats_selectedMetrics items
Type | string |
Required | No |
26.14. Property cc-backend configuration file schema > ui-defaults > job_view_selectedMetrics
Type | array of string |
Required | Yes |
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
job_view_selectedMetrics items | - |
26.14.1. cc-backend configuration file schema > ui-defaults > job_view_selectedMetrics > job_view_selectedMetrics items
Type | string |
Required | No |
26.15. Property cc-backend configuration file schema > ui-defaults > plot_general_colorscheme
Type | array of string |
Required | Yes |
Description: Initial color scheme
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
plot_general_colorscheme items | - |
26.15.1. cc-backend configuration file schema > ui-defaults > plot_general_colorscheme > plot_general_colorscheme items
Type | string |
Required | No |
26.16. Property cc-backend configuration file schema > ui-defaults > plot_list_selectedMetrics
Type | array of string |
Required | Yes |
Description: Initial metric plots shown in jobs lists
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
plot_list_selectedMetrics items | - |
26.16.1. cc-backend configuration file schema > ui-defaults > plot_list_selectedMetrics > plot_list_selectedMetrics items
Type | string |
Required | No |
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.1.7.2 - Cluster Schema
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024HPC cluster description
- 1. Property
HPC cluster description > name
- 2. Property
HPC cluster description > metricConfig
- 2.1. HPC cluster description > metricConfig > metricConfig items
- 2.1.1. Property
HPC cluster description > metricConfig > metricConfig items > name
- 2.1.2. Property
HPC cluster description > metricConfig > metricConfig items > unit
- 2.1.3. Property
HPC cluster description > metricConfig > metricConfig items > scope
- 2.1.4. Property
HPC cluster description > metricConfig > metricConfig items > timestep
- 2.1.5. Property
HPC cluster description > metricConfig > metricConfig items > aggregation
- 2.1.6. Property
HPC cluster description > metricConfig > metricConfig items > footprint
- 2.1.7. Property
HPC cluster description > metricConfig > metricConfig items > energy
- 2.1.8. Property
HPC cluster description > metricConfig > metricConfig items > lowerIsBetter
- 2.1.9. Property
HPC cluster description > metricConfig > metricConfig items > peak
- 2.1.10. Property
HPC cluster description > metricConfig > metricConfig items > normal
- 2.1.11. Property
HPC cluster description > metricConfig > metricConfig items > caution
- 2.1.12. Property
HPC cluster description > metricConfig > metricConfig items > alert
- 2.1.13. Property
HPC cluster description > metricConfig > metricConfig items > subClusters
- 2.1.13.1. HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items
- 2.1.13.1.1. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > name
- 2.1.13.1.2. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > footprint
- 2.1.13.1.3. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > energy
- 2.1.13.1.4. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > lowerIsBetter
- 2.1.13.1.5. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > peak
- 2.1.13.1.6. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > normal
- 2.1.13.1.7. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > caution
- 2.1.13.1.8. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > alert
- 2.1.13.1.9. Property
HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > remove
- 2.1.13.1.1. Property
- 2.1.13.1. HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items
- 2.1.1. Property
- 2.1. HPC cluster description > metricConfig > metricConfig items
- 3. Property
HPC cluster description > subClusters
- 3.1. HPC cluster description > subClusters > subClusters items
- 3.1.1. Property
HPC cluster description > subClusters > subClusters items > name
- 3.1.2. Property
HPC cluster description > subClusters > subClusters items > processorType
- 3.1.3. Property
HPC cluster description > subClusters > subClusters items > socketsPerNode
- 3.1.4. Property
HPC cluster description > subClusters > subClusters items > coresPerSocket
- 3.1.5. Property
HPC cluster description > subClusters > subClusters items > threadsPerCore
- 3.1.6. Property
HPC cluster description > subClusters > subClusters items > flopRateScalar
- 3.1.7. Property
HPC cluster description > subClusters > subClusters items > flopRateSimd
- 3.1.8. Property
HPC cluster description > subClusters > subClusters items > memoryBandwidth
- 3.1.9. Property
HPC cluster description > subClusters > subClusters items > nodes
- 3.1.10. Property
HPC cluster description > subClusters > subClusters items > topology
- 3.1.10.1. Property
HPC cluster description > subClusters > subClusters items > topology > node
- 3.1.10.2. Property
HPC cluster description > subClusters > subClusters items > topology > socket
- 3.1.10.3. Property
HPC cluster description > subClusters > subClusters items > topology > memoryDomain
- 3.1.10.4. Property
HPC cluster description > subClusters > subClusters items > topology > die
- 3.1.10.5. Property
HPC cluster description > subClusters > subClusters items > topology > core
- 3.1.10.6. Property
HPC cluster description > subClusters > subClusters items > topology > accelerators
- 3.1.10.6.1. HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items
- 3.1.10.6.1.1. Property
HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > id
- 3.1.10.6.1.2. Property
HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > type
- 3.1.10.6.1.3. Property
HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > model
- 3.1.10.6.1.1. Property
- 3.1.10.6.1. HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items
- 3.1.10.1. Property
- 3.1.1. Property
- 3.1. HPC cluster description > subClusters > subClusters items
Title: HPC cluster description
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Meta data information of a HPC cluster
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | The unique identifier of a cluster |
+ metricConfig | No | array of object | No | - | Metric specifications |
+ subClusters | No | array of object | No | - | Array of cluster hardware partitions |
1. Property HPC cluster description > name
Type | string |
Required | Yes |
Description: The unique identifier of a cluster
2. Property HPC cluster description > metricConfig
Type | array of object |
Required | Yes |
Description: Metric specifications
Array restrictions | |
---|---|
Min items | 1 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
metricConfig items | - |
2.1. HPC cluster description > metricConfig > metricConfig items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | Metric name |
+ unit | No | object | No | In embedfs://unit.schema.json | Metric unit |
+ scope | No | string | No | - | Native measurement resolution |
+ timestep | No | integer | No | - | Frequency of timeseries points |
+ aggregation | No | enum (of string) | No | - | How the metric is aggregated |
- footprint | No | enum (of string) | No | - | Is it a footprint metric and what type |
- energy | No | enum (of string) | No | - | Is it used to calculate job energy |
- lowerIsBetter | No | boolean | No | - | Is lower better. |
+ peak | No | number | No | - | Metric peak threshold (Upper metric limit) |
+ normal | No | number | No | - | Metric normal threshold |
+ caution | No | number | No | - | Metric caution threshold (Suspicious but does not require immediate action) |
+ alert | No | number | No | - | Metric alert threshold (Requires immediate action) |
- subClusters | No | array of object | No | - | Array of cluster hardware partition metric thresholds |
2.1.1. Property HPC cluster description > metricConfig > metricConfig items > name
Type | string |
Required | Yes |
Description: Metric name
2.1.2. Property HPC cluster description > metricConfig > metricConfig items > unit
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://unit.schema.json |
Description: Metric unit
2.1.3. Property HPC cluster description > metricConfig > metricConfig items > scope
Type | string |
Required | Yes |
Description: Native measurement resolution
2.1.4. Property HPC cluster description > metricConfig > metricConfig items > timestep
Type | integer |
Required | Yes |
Description: Frequency of timeseries points
2.1.5. Property HPC cluster description > metricConfig > metricConfig items > aggregation
Type | enum (of string) |
Required | Yes |
Description: How the metric is aggregated
Must be one of:
- “sum”
- “avg”
2.1.6. Property HPC cluster description > metricConfig > metricConfig items > footprint
Type | enum (of string) |
Required | No |
Description: Is it a footprint metric and what type
Must be one of:
- “avg”
- “max”
- “min”
2.1.7. Property HPC cluster description > metricConfig > metricConfig items > energy
Type | enum (of string) |
Required | No |
Description: Is it used to calculate job energy
Must be one of:
- “power”
- “energy”
2.1.8. Property HPC cluster description > metricConfig > metricConfig items > lowerIsBetter
Type | boolean |
Required | No |
Description: Is lower better.
2.1.9. Property HPC cluster description > metricConfig > metricConfig items > peak
Type | number |
Required | Yes |
Description: Metric peak threshold (Upper metric limit)
2.1.10. Property HPC cluster description > metricConfig > metricConfig items > normal
Type | number |
Required | Yes |
Description: Metric normal threshold
2.1.11. Property HPC cluster description > metricConfig > metricConfig items > caution
Type | number |
Required | Yes |
Description: Metric caution threshold (Suspicious but does not require immediate action)
2.1.12. Property HPC cluster description > metricConfig > metricConfig items > alert
Type | number |
Required | Yes |
Description: Metric alert threshold (Requires immediate action)
2.1.13. Property HPC cluster description > metricConfig > metricConfig items > subClusters
Type | array of object |
Required | No |
Description: Array of cluster hardware partition metric thresholds
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
subClusters items | - |
2.1.13.1. HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | Hardware partition name |
- footprint | No | enum (of string) | No | - | Is it a footprint metric and what type. Overwrite global setting |
- energy | No | enum (of string) | No | - | Is it used to calculate job energy. Overwrite global |
- lowerIsBetter | No | boolean | No | - | Is lower better. Overwrite global |
- peak | No | number | No | - | - |
- normal | No | number | No | - | - |
- caution | No | number | No | - | - |
- alert | No | number | No | - | - |
- remove | No | boolean | No | - | Remove this metric for this subcluster |
2.1.13.1.1. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > name
Type | string |
Required | Yes |
Description: Hardware partition name
2.1.13.1.2. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > footprint
Type | enum (of string) |
Required | No |
Description: Is it a footprint metric and what type. Overwrite global setting
Must be one of:
- “avg”
- “max”
- “min”
2.1.13.1.3. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > energy
Type | enum (of string) |
Required | No |
Description: Is it used to calculate job energy. Overwrite global
Must be one of:
- “power”
- “energy”
2.1.13.1.4. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > lowerIsBetter
Type | boolean |
Required | No |
Description: Is lower better. Overwrite global
2.1.13.1.5. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > peak
Type | number |
Required | No |
2.1.13.1.6. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > normal
Type | number |
Required | No |
2.1.13.1.7. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > caution
Type | number |
Required | No |
2.1.13.1.8. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > alert
Type | number |
Required | No |
2.1.13.1.9. Property HPC cluster description > metricConfig > metricConfig items > subClusters > subClusters items > remove
Type | boolean |
Required | No |
Description: Remove this metric for this subcluster
3. Property HPC cluster description > subClusters
Type | array of object |
Required | Yes |
Description: Array of cluster hardware partitions
Array restrictions | |
---|---|
Min items | 1 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
subClusters items | - |
3.1. HPC cluster description > subClusters > subClusters items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | Hardware partition name |
+ processorType | No | string | No | - | Processor type |
+ socketsPerNode | No | integer | No | - | Number of sockets per node |
+ coresPerSocket | No | integer | No | - | Number of cores per socket |
+ threadsPerCore | No | integer | No | - | Number of SMT threads per core |
+ flopRateScalar | No | object | No | - | Theoretical node peak flop rate for scalar code in GFlops/s |
+ flopRateSimd | No | object | No | - | Theoretical node peak flop rate for SIMD code in GFlops/s |
+ memoryBandwidth | No | object | No | - | Theoretical node peak memory bandwidth in GB/s |
+ nodes | No | string | No | - | Node list expression |
+ topology | No | object | No | - | Node topology |
3.1.1. Property HPC cluster description > subClusters > subClusters items > name
Type | string |
Required | Yes |
Description: Hardware partition name
3.1.2. Property HPC cluster description > subClusters > subClusters items > processorType
Type | string |
Required | Yes |
Description: Processor type
3.1.3. Property HPC cluster description > subClusters > subClusters items > socketsPerNode
Type | integer |
Required | Yes |
Description: Number of sockets per node
3.1.4. Property HPC cluster description > subClusters > subClusters items > coresPerSocket
Type | integer |
Required | Yes |
Description: Number of cores per socket
3.1.5. Property HPC cluster description > subClusters > subClusters items > threadsPerCore
Type | integer |
Required | Yes |
Description: Number of SMT threads per core
3.1.6. Property HPC cluster description > subClusters > subClusters items > flopRateScalar
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Theoretical node peak flop rate for scalar code in GFlops/s
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- unit | No | object | No | In embedfs://unit.schema.json | Metric unit |
- value | No | number | No | - | - |
3.1.6.1. Property HPC cluster description > subClusters > subClusters items > flopRateScalar > unit
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://unit.schema.json |
Description: Metric unit
3.1.6.2. Property HPC cluster description > subClusters > subClusters items > flopRateScalar > value
Type | number |
Required | No |
3.1.7. Property HPC cluster description > subClusters > subClusters items > flopRateSimd
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Theoretical node peak flop rate for SIMD code in GFlops/s
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- unit | No | object | No | In embedfs://unit.schema.json | Metric unit |
- value | No | number | No | - | - |
3.1.7.1. Property HPC cluster description > subClusters > subClusters items > flopRateSimd > unit
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://unit.schema.json |
Description: Metric unit
3.1.7.2. Property HPC cluster description > subClusters > subClusters items > flopRateSimd > value
Type | number |
Required | No |
3.1.8. Property HPC cluster description > subClusters > subClusters items > memoryBandwidth
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Theoretical node peak memory bandwidth in GB/s
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- unit | No | object | No | In embedfs://unit.schema.json | Metric unit |
- value | No | number | No | - | - |
3.1.8.1. Property HPC cluster description > subClusters > subClusters items > memoryBandwidth > unit
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://unit.schema.json |
Description: Metric unit
3.1.8.2. Property HPC cluster description > subClusters > subClusters items > memoryBandwidth > value
Type | number |
Required | No |
3.1.9. Property HPC cluster description > subClusters > subClusters items > nodes
Type | string |
Required | Yes |
Description: Node list expression
3.1.10. Property HPC cluster description > subClusters > subClusters items > topology
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Node topology
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | array of integer | No | - | HwTread lists of node |
+ socket | No | array of array | No | - | HwTread lists of sockets |
+ memoryDomain | No | array of array | No | - | HwTread lists of memory domains |
- die | No | array of array | No | - | HwTread lists of dies |
- core | No | array of array | No | - | HwTread lists of cores |
- accelerators | No | array of object | No | - | List of of accelerator devices |
3.1.10.1. Property HPC cluster description > subClusters > subClusters items > topology > node
Type | array of integer |
Required | Yes |
Description: HwTread lists of node
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
node items | - |
3.1.10.1.1. HPC cluster description > subClusters > subClusters items > topology > node > node items
Type | integer |
Required | No |
3.1.10.2. Property HPC cluster description > subClusters > subClusters items > topology > socket
Type | array of array |
Required | Yes |
Description: HwTread lists of sockets
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
socket items | - |
3.1.10.2.1. HPC cluster description > subClusters > subClusters items > topology > socket > socket items
Type | array of integer |
Required | No |
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
socket items items | - |
3.1.10.2.1.1. HPC cluster description > subClusters > subClusters items > topology > socket > socket items > socket items items
Type | integer |
Required | No |
3.1.10.3. Property HPC cluster description > subClusters > subClusters items > topology > memoryDomain
Type | array of array |
Required | Yes |
Description: HwTread lists of memory domains
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
memoryDomain items | - |
3.1.10.3.1. HPC cluster description > subClusters > subClusters items > topology > memoryDomain > memoryDomain items
Type | array of integer |
Required | No |
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
memoryDomain items items | - |
3.1.10.3.1.1. HPC cluster description > subClusters > subClusters items > topology > memoryDomain > memoryDomain items > memoryDomain items items
Type | integer |
Required | No |
3.1.10.4. Property HPC cluster description > subClusters > subClusters items > topology > die
Type | array of array |
Required | No |
Description: HwTread lists of dies
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
die items | - |
3.1.10.4.1. HPC cluster description > subClusters > subClusters items > topology > die > die items
Type | array of integer |
Required | No |
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
die items items | - |
3.1.10.4.1.1. HPC cluster description > subClusters > subClusters items > topology > die > die items > die items items
Type | integer |
Required | No |
3.1.10.5. Property HPC cluster description > subClusters > subClusters items > topology > core
Type | array of array |
Required | No |
Description: HwTread lists of cores
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
core items | - |
3.1.10.5.1. HPC cluster description > subClusters > subClusters items > topology > core > core items
Type | array of integer |
Required | No |
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
core items items | - |
3.1.10.5.1.1. HPC cluster description > subClusters > subClusters items > topology > core > core items > core items items
Type | integer |
Required | No |
3.1.10.6. Property HPC cluster description > subClusters > subClusters items > topology > accelerators
Type | array of object |
Required | No |
Description: List of of accelerator devices
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
accelerators items | - |
3.1.10.6.1. HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ id | No | string | No | - | The unique device id |
+ type | No | enum (of string) | No | - | The accelerator type |
+ model | No | string | No | - | The accelerator model |
3.1.10.6.1.1. Property HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > id
Type | string |
Required | Yes |
Description: The unique device id
3.1.10.6.1.2. Property HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > type
Type | enum (of string) |
Required | Yes |
Description: The accelerator type
Must be one of:
- “Nvidia GPU”
- “AMD GPU”
- “Intel GPU”
3.1.10.6.1.3. Property HPC cluster description > subClusters > subClusters items > topology > accelerators > accelerators items > model
Type | string |
Required | Yes |
Description: The accelerator model
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.1.7.3 - Job Data Schema
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024Job metric data list
- 1. Property
Job metric data list > mem_used
- 2. Property
Job metric data list > flops_any
- 3. Property
Job metric data list > mem_bw
- 4. Property
Job metric data list > net_bw
- 5. Property
Job metric data list > ipc
- 6. Property
Job metric data list > cpu_user
- 7. Property
Job metric data list > cpu_load
- 8. Property
Job metric data list > flops_dp
- 9. Property
Job metric data list > flops_sp
- 10. Property
Job metric data list > vectorization_ratio
- 10.1. Property
Job metric data list > vectorization_ratio > node
- 10.2. Property
Job metric data list > vectorization_ratio > socket
- 10.3. Property
Job metric data list > vectorization_ratio > memoryDomain
- 10.4. Property
Job metric data list > vectorization_ratio > core
- 10.5. Property
Job metric data list > vectorization_ratio > hwthread
- 10.1. Property
- 11. Property
Job metric data list > cpu_power
- 12. Property
Job metric data list > mem_power
- 13. Property
Job metric data list > acc_utilization
- 14. Property
Job metric data list > acc_mem_used
- 15. Property
Job metric data list > acc_power
- 16. Property
Job metric data list > clock
- 17. Property
Job metric data list > eth_read_bw
- 18. Property
Job metric data list > eth_write_bw
- 19. Property
Job metric data list > filesystems
- 19.1. Job metric data list > filesystems > filesystems items
- 19.1.1. Property
Job metric data list > filesystems > filesystems items > name
- 19.1.2. Property
Job metric data list > filesystems > filesystems items > type
- 19.1.3. Property
Job metric data list > filesystems > filesystems items > read_bw
- 19.1.4. Property
Job metric data list > filesystems > filesystems items > write_bw
- 19.1.5. Property
Job metric data list > filesystems > filesystems items > read_req
- 19.1.6. Property
Job metric data list > filesystems > filesystems items > write_req
- 19.1.7. Property
Job metric data list > filesystems > filesystems items > inodes
- 19.1.8. Property
Job metric data list > filesystems > filesystems items > accesses
- 19.1.9. Property
Job metric data list > filesystems > filesystems items > fsync
- 19.1.10. Property
Job metric data list > filesystems > filesystems items > create
- 19.1.11. Property
Job metric data list > filesystems > filesystems items > open
- 19.1.12. Property
Job metric data list > filesystems > filesystems items > close
- 19.1.13. Property
Job metric data list > filesystems > filesystems items > seek
- 19.1.1. Property
- 19.1. Job metric data list > filesystems > filesystems items
Title: Job metric data list
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Collection of metric data of a HPC job
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ mem_used | No | object | No | - | Memory capacity used |
+ flops_any | No | object | No | - | Total flop rate with DP flops scaled up |
+ mem_bw | No | object | No | - | Main memory bandwidth |
+ net_bw | No | object | No | - | Total fast interconnect network bandwidth |
- ipc | No | object | No | - | Instructions executed per cycle |
+ cpu_user | No | object | No | - | CPU user active core utilization |
+ cpu_load | No | object | No | - | CPU requested core utilization (load 1m) |
- flops_dp | No | object | No | - | Double precision flop rate |
- flops_sp | No | object | No | - | Single precision flops rate |
- vectorization_ratio | No | object | No | - | Fraction of arithmetic instructions using SIMD instructions |
- cpu_power | No | object | No | - | CPU power consumption |
- mem_power | No | object | No | - | Memory power consumption |
- acc_utilization | No | object | No | - | GPU utilization |
- acc_mem_used | No | object | No | - | GPU memory capacity used |
- acc_power | No | object | No | - | GPU power consumption |
- clock | No | object | No | - | Average core frequency |
- eth_read_bw | No | object | No | - | Ethernet read bandwidth |
- eth_write_bw | No | object | No | - | Ethernet write bandwidth |
+ filesystems | No | array of object | No | - | Array of filesystems |
1. Property Job metric data list > mem_used
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Memory capacity used
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
1.1. Property Job metric data list > mem_used > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
2. Property Job metric data list > flops_any
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Total flop rate with DP flops scaled up
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
2.1. Property Job metric data list > flops_any > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
2.2. Property Job metric data list > flops_any > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
2.3. Property Job metric data list > flops_any > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
2.4. Property Job metric data list > flops_any > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
2.5. Property Job metric data list > flops_any > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
3. Property Job metric data list > mem_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Main memory bandwidth
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
3.1. Property Job metric data list > mem_bw > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
3.2. Property Job metric data list > mem_bw > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
3.3. Property Job metric data list > mem_bw > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
4. Property Job metric data list > net_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Total fast interconnect network bandwidth
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
4.1. Property Job metric data list > net_bw > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
5. Property Job metric data list > ipc
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Instructions executed per cycle
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
5.1. Property Job metric data list > ipc > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
5.2. Property Job metric data list > ipc > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
5.3. Property Job metric data list > ipc > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
5.4. Property Job metric data list > ipc > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
5.5. Property Job metric data list > ipc > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
6. Property Job metric data list > cpu_user
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: CPU user active core utilization
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
6.1. Property Job metric data list > cpu_user > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
6.2. Property Job metric data list > cpu_user > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
6.3. Property Job metric data list > cpu_user > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
6.4. Property Job metric data list > cpu_user > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
6.5. Property Job metric data list > cpu_user > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
7. Property Job metric data list > cpu_load
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: CPU requested core utilization (load 1m)
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
7.1. Property Job metric data list > cpu_load > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
8. Property Job metric data list > flops_dp
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Double precision flop rate
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
8.1. Property Job metric data list > flops_dp > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
8.2. Property Job metric data list > flops_dp > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
8.3. Property Job metric data list > flops_dp > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
8.4. Property Job metric data list > flops_dp > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
8.5. Property Job metric data list > flops_dp > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
9. Property Job metric data list > flops_sp
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Single precision flops rate
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
9.1. Property Job metric data list > flops_sp > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
9.2. Property Job metric data list > flops_sp > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
9.3. Property Job metric data list > flops_sp > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
9.4. Property Job metric data list > flops_sp > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
9.5. Property Job metric data list > flops_sp > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
10. Property Job metric data list > vectorization_ratio
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Fraction of arithmetic instructions using SIMD instructions
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
10.1. Property Job metric data list > vectorization_ratio > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
10.2. Property Job metric data list > vectorization_ratio > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
10.3. Property Job metric data list > vectorization_ratio > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
10.4. Property Job metric data list > vectorization_ratio > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
10.5. Property Job metric data list > vectorization_ratio > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
11. Property Job metric data list > cpu_power
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: CPU power consumption
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
11.1. Property Job metric data list > cpu_power > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
11.2. Property Job metric data list > cpu_power > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
12. Property Job metric data list > mem_power
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Memory power consumption
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
12.1. Property Job metric data list > mem_power > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
12.2. Property Job metric data list > mem_power > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
13. Property Job metric data list > acc_utilization
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: GPU utilization
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ accelerator | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
13.1. Property Job metric data list > acc_utilization > accelerator
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
14. Property Job metric data list > acc_mem_used
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: GPU memory capacity used
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ accelerator | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
14.1. Property Job metric data list > acc_mem_used > accelerator
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
15. Property Job metric data list > acc_power
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: GPU power consumption
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ accelerator | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
15.1. Property Job metric data list > acc_power > accelerator
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
16. Property Job metric data list > clock
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Average core frequency
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- socket | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- memoryDomain | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- core | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
- hwthread | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
16.1. Property Job metric data list > clock > node
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
16.2. Property Job metric data list > clock > socket
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
16.3. Property Job metric data list > clock > memoryDomain
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
16.4. Property Job metric data list > clock > core
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
16.5. Property Job metric data list > clock > hwthread
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
17. Property Job metric data list > eth_read_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Ethernet read bandwidth
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
17.1. Property Job metric data list > eth_read_bw > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
18. Property Job metric data list > eth_write_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Ethernet write bandwidth
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
18.1. Property Job metric data list > eth_write_bw > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19. Property Job metric data list > filesystems
Type | array of object |
Required | Yes |
Description: Array of filesystems
Array restrictions | |
---|---|
Min items | 1 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
filesystems items | - |
19.1. Job metric data list > filesystems > filesystems items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | - |
+ type | No | enum (of string) | No | - | - |
+ read_bw | No | object | No | - | File system read bandwidth |
+ write_bw | No | object | No | - | File system write bandwidth |
- read_req | No | object | No | - | File system read requests |
- write_req | No | object | No | - | File system write requests |
- inodes | No | object | No | - | File system write requests |
- accesses | No | object | No | - | File system open and close |
- fsync | No | object | No | - | File system fsync |
- create | No | object | No | - | File system create |
- open | No | object | No | - | File system open |
- close | No | object | No | - | File system close |
- seek | No | object | No | - | File system seek |
19.1.1. Property Job metric data list > filesystems > filesystems items > name
Type | string |
Required | Yes |
19.1.2. Property Job metric data list > filesystems > filesystems items > type
Type | enum (of string) |
Required | Yes |
Must be one of:
- “nfs”
- “lustre”
- “gpfs”
- “nvme”
- “ssd”
- “hdd”
- “beegfs”
19.1.3. Property Job metric data list > filesystems > filesystems items > read_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: File system read bandwidth
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.3.1. Property Job metric data list > filesystems > filesystems items > read_bw > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.4. Property Job metric data list > filesystems > filesystems items > write_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: File system write bandwidth
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.4.1. Property Job metric data list > filesystems > filesystems items > write_bw > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.5. Property Job metric data list > filesystems > filesystems items > read_req
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system read requests
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.5.1. Property Job metric data list > filesystems > filesystems items > read_req > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.6. Property Job metric data list > filesystems > filesystems items > write_req
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system write requests
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.6.1. Property Job metric data list > filesystems > filesystems items > write_req > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.7. Property Job metric data list > filesystems > filesystems items > inodes
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system write requests
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.7.1. Property Job metric data list > filesystems > filesystems items > inodes > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.8. Property Job metric data list > filesystems > filesystems items > accesses
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system open and close
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.8.1. Property Job metric data list > filesystems > filesystems items > accesses > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.9. Property Job metric data list > filesystems > filesystems items > fsync
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system fsync
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.9.1. Property Job metric data list > filesystems > filesystems items > fsync > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.10. Property Job metric data list > filesystems > filesystems items > create
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system create
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.10.1. Property Job metric data list > filesystems > filesystems items > create > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.11. Property Job metric data list > filesystems > filesystems items > open
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system open
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.11.1. Property Job metric data list > filesystems > filesystems items > open > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.12. Property Job metric data list > filesystems > filesystems items > close
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system close
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.12.1. Property Job metric data list > filesystems > filesystems items > close > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
19.1.13. Property Job metric data list > filesystems > filesystems items > seek
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: File system seek
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ node | No | object | No | In embedfs://job-metric-data.schema.json | 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️ |
19.1.13.1. Property Job metric data list > filesystems > filesystems items > seek > node
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-data.schema.json |
Description: 😅 ERROR in schema generation, a referenced schema could not be loaded, no documentation here unfortunately 🏜️
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.1.7.4 - Job Statistics Schema
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024Job statistics
- 1. Property
Job statistics > unit
- 2. Property
Job statistics > avg
- 3. Property
Job statistics > min
- 4. Property
Job statistics > max
Title: Job statistics
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Format specification for job metric statistics
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ unit | No | object | No | In embedfs://unit.schema.json | Metric unit |
+ avg | No | number | No | - | Job metric average |
+ min | No | number | No | - | Job metric minimum |
+ max | No | number | No | - | Job metric maximum |
1. Property Job statistics > unit
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://unit.schema.json |
Description: Metric unit
2. Property Job statistics > avg
Type | number |
Required | Yes |
Description: Job metric average
Restrictions | |
---|---|
Minimum | ≥ 0 |
3. Property Job statistics > min
Type | number |
Required | Yes |
Description: Job metric minimum
Restrictions | |
---|---|
Minimum | ≥ 0 |
4. Property Job statistics > max
Type | number |
Required | Yes |
Description: Job metric maximum
Restrictions | |
---|---|
Minimum | ≥ 0 |
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.1.7.5 - Unit Schema
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024Metric unit
Title: Metric unit
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Format specification for job metric units
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ base | No | enum (of string) | No | - | Metric base unit |
- prefix | No | enum (of string) | No | - | Unit prefix |
1. Property Metric unit > base
Type | enum (of string) |
Required | Yes |
Description: Metric base unit
Must be one of:
- “B”
- “F”
- “B/s”
- “F/s”
- “CPI”
- “IPC”
- “Hz”
- “W”
- “°C”
- ""
2. Property Metric unit > prefix
Type | enum (of string) |
Required | No |
Description: Unit prefix
Must be one of:
- “K”
- “M”
- “G”
- “T”
- “P”
- “E”
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.1.7.6 - Job Archive Metadata Schema
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024Job meta data
- 1. Property
Job meta data > jobId
- 2. Property
Job meta data > user
- 3. Property
Job meta data > project
- 4. Property
Job meta data > cluster
- 5. Property
Job meta data > subCluster
- 6. Property
Job meta data > partition
- 7. Property
Job meta data > arrayJobId
- 8. Property
Job meta data > numNodes
- 9. Property
Job meta data > numHwthreads
- 10. Property
Job meta data > numAcc
- 11. Property
Job meta data > exclusive
- 12. Property
Job meta data > monitoringStatus
- 13. Property
Job meta data > smt
- 14. Property
Job meta data > walltime
- 15. Property
Job meta data > jobState
- 16. Property
Job meta data > startTime
- 17. Property
Job meta data > duration
- 18. Property
Job meta data > resources
- 18.1. Job meta data > resources > resources items
- 19. Property
Job meta data > metaData
- 20. Property
Job meta data > tags
- 21. Property
Job meta data > statistics
- 21.1. Property
Job meta data > statistics > mem_used
- 21.2. Property
Job meta data > statistics > cpu_load
- 21.3. Property
Job meta data > statistics > flops_any
- 21.4. Property
Job meta data > statistics > mem_bw
- 21.5. Property
Job meta data > statistics > net_bw
- 21.6. Property
Job meta data > statistics > file_bw
- 21.7. Property
Job meta data > statistics > ipc
- 21.8. Property
Job meta data > statistics > cpu_user
- 21.9. Property
Job meta data > statistics > flops_dp
- 21.10. Property
Job meta data > statistics > flops_sp
- 21.11. Property
Job meta data > statistics > rapl_power
- 21.12. Property
Job meta data > statistics > acc_used
- 21.13. Property
Job meta data > statistics > acc_mem_used
- 21.14. Property
Job meta data > statistics > acc_power
- 21.15. Property
Job meta data > statistics > clock
- 21.16. Property
Job meta data > statistics > eth_read_bw
- 21.17. Property
Job meta data > statistics > eth_write_bw
- 21.18. Property
Job meta data > statistics > ic_rcv_packets
- 21.19. Property
Job meta data > statistics > ic_send_packets
- 21.20. Property
Job meta data > statistics > ic_read_bw
- 21.21. Property
Job meta data > statistics > ic_write_bw
- 21.22. Property
Job meta data > statistics > filesystems
- 21.22.1. Job meta data > statistics > filesystems > filesystems items
- 21.22.1.1. Property
Job meta data > statistics > filesystems > filesystems items > name
- 21.22.1.2. Property
Job meta data > statistics > filesystems > filesystems items > type
- 21.22.1.3. Property
Job meta data > statistics > filesystems > filesystems items > read_bw
- 21.22.1.4. Property
Job meta data > statistics > filesystems > filesystems items > write_bw
- 21.22.1.5. Property
Job meta data > statistics > filesystems > filesystems items > read_req
- 21.22.1.6. Property
Job meta data > statistics > filesystems > filesystems items > write_req
- 21.22.1.7. Property
Job meta data > statistics > filesystems > filesystems items > inodes
- 21.22.1.8. Property
Job meta data > statistics > filesystems > filesystems items > accesses
- 21.22.1.9. Property
Job meta data > statistics > filesystems > filesystems items > fsync
- 21.22.1.10. Property
Job meta data > statistics > filesystems > filesystems items > create
- 21.22.1.11. Property
Job meta data > statistics > filesystems > filesystems items > open
- 21.22.1.12. Property
Job meta data > statistics > filesystems > filesystems items > close
- 21.22.1.13. Property
Job meta data > statistics > filesystems > filesystems items > seek
- 21.22.1.1. Property
- 21.22.1. Job meta data > statistics > filesystems > filesystems items
- 21.1. Property
Title: Job meta data
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Meta data information of a HPC job
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ jobId | No | integer | No | - | The unique identifier of a job |
+ user | No | string | No | - | The unique identifier of a user |
+ project | No | string | No | - | The unique identifier of a project |
+ cluster | No | string | No | - | The unique identifier of a cluster |
+ subCluster | No | string | No | - | The unique identifier of a sub cluster |
- partition | No | string | No | - | The Slurm partition to which the job was submitted |
- arrayJobId | No | integer | No | - | The unique identifier of an array job |
+ numNodes | No | integer | No | - | Number of nodes used |
- numHwthreads | No | integer | No | - | Number of HWThreads used |
- numAcc | No | integer | No | - | Number of accelerators used |
+ exclusive | No | integer | No | - | Specifies how nodes are shared. 0 - Shared among multiple jobs of multiple users, 1 - Job exclusive, 2 - Shared among multiple jobs of same user |
- monitoringStatus | No | integer | No | - | State of monitoring system during job run |
- smt | No | integer | No | - | SMT threads used by job |
- walltime | No | integer | No | - | Requested walltime of job in seconds |
+ jobState | No | enum (of string) | No | - | Final state of job |
+ startTime | No | integer | No | - | Start epoch time stamp in seconds |
+ duration | No | integer | No | - | Duration of job in seconds |
+ resources | No | array of object | No | - | Resources used by job |
- metaData | No | object | No | - | Additional information about the job |
- tags | No | array of object | No | - | List of tags |
+ statistics | No | object | No | - | Job statistic data |
1. Property Job meta data > jobId
Type | integer |
Required | Yes |
Description: The unique identifier of a job
2. Property Job meta data > user
Type | string |
Required | Yes |
Description: The unique identifier of a user
3. Property Job meta data > project
Type | string |
Required | Yes |
Description: The unique identifier of a project
4. Property Job meta data > cluster
Type | string |
Required | Yes |
Description: The unique identifier of a cluster
5. Property Job meta data > subCluster
Type | string |
Required | Yes |
Description: The unique identifier of a sub cluster
6. Property Job meta data > partition
Type | string |
Required | No |
Description: The Slurm partition to which the job was submitted
7. Property Job meta data > arrayJobId
Type | integer |
Required | No |
Description: The unique identifier of an array job
8. Property Job meta data > numNodes
Type | integer |
Required | Yes |
Description: Number of nodes used
Restrictions | |
---|---|
Minimum | > 0 |
9. Property Job meta data > numHwthreads
Type | integer |
Required | No |
Description: Number of HWThreads used
Restrictions | |
---|---|
Minimum | > 0 |
10. Property Job meta data > numAcc
Type | integer |
Required | No |
Description: Number of accelerators used
Restrictions | |
---|---|
Minimum | > 0 |
11. Property Job meta data > exclusive
Type | integer |
Required | Yes |
Description: Specifies how nodes are shared. 0 - Shared among multiple jobs of multiple users, 1 - Job exclusive, 2 - Shared among multiple jobs of same user
Restrictions | |
---|---|
Minimum | ≥ 0 |
Maximum | ≤ 2 |
12. Property Job meta data > monitoringStatus
Type | integer |
Required | No |
Description: State of monitoring system during job run
13. Property Job meta data > smt
Type | integer |
Required | No |
Description: SMT threads used by job
14. Property Job meta data > walltime
Type | integer |
Required | No |
Description: Requested walltime of job in seconds
Restrictions | |
---|---|
Minimum | > 0 |
15. Property Job meta data > jobState
Type | enum (of string) |
Required | Yes |
Description: Final state of job
Must be one of:
- “completed”
- “failed”
- “cancelled”
- “stopped”
- “out_of_memory”
- “timeout”
16. Property Job meta data > startTime
Type | integer |
Required | Yes |
Description: Start epoch time stamp in seconds
Restrictions | |
---|---|
Minimum | > 0 |
17. Property Job meta data > duration
Type | integer |
Required | Yes |
Description: Duration of job in seconds
Restrictions | |
---|---|
Minimum | > 0 |
18. Property Job meta data > resources
Type | array of object |
Required | Yes |
Description: Resources used by job
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
resources items | - |
18.1. Job meta data > resources > resources items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ hostname | No | string | No | - | - |
- hwthreads | No | array of integer | No | - | List of OS processor ids |
- accelerators | No | array of string | No | - | List of of accelerator device ids |
- configuration | No | string | No | - | The configuration options of the node |
18.1.1. Property Job meta data > resources > resources items > hostname
Type | string |
Required | Yes |
18.1.2. Property Job meta data > resources > resources items > hwthreads
Type | array of integer |
Required | No |
Description: List of OS processor ids
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
hwthreads items | - |
18.1.2.1. Job meta data > resources > resources items > hwthreads > hwthreads items
Type | integer |
Required | No |
18.1.3. Property Job meta data > resources > resources items > accelerators
Type | array of string |
Required | No |
Description: List of of accelerator device ids
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
accelerators items | - |
18.1.3.1. Job meta data > resources > resources items > accelerators > accelerators items
Type | string |
Required | No |
18.1.4. Property Job meta data > resources > resources items > configuration
Type | string |
Required | No |
Description: The configuration options of the node
19. Property Job meta data > metaData
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Additional information about the job
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- jobScript | No | string | No | - | The batch script of the job |
- jobName | No | string | No | - | Slurm Job name |
- slurmInfo | No | string | No | - | Additional slurm infos as show by scontrol show job |
19.1. Property Job meta data > metaData > jobScript
Type | string |
Required | No |
Description: The batch script of the job
19.2. Property Job meta data > metaData > jobName
Type | string |
Required | No |
Description: Slurm Job name
19.3. Property Job meta data > metaData > slurmInfo
Type | string |
Required | No |
Description: Additional slurm infos as show by scontrol show job
20. Property Job meta data > tags
Type | array of object |
Required | No |
Description: List of tags
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | True |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
tags items | - |
20.1. Job meta data > tags > tags items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | - |
+ type | No | string | No | - | - |
20.1.1. Property Job meta data > tags > tags items > name
Type | string |
Required | Yes |
20.1.2. Property Job meta data > tags > tags items > type
Type | string |
Required | Yes |
21. Property Job meta data > statistics
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Job statistic data
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ mem_used | No | object | No | In embedfs://job-metric-statistics.schema.json | Memory capacity used (required) |
+ cpu_load | No | object | No | In embedfs://job-metric-statistics.schema.json | CPU requested core utilization (load 1m) (required) |
+ flops_any | No | object | No | In embedfs://job-metric-statistics.schema.json | Total flop rate with DP flops scaled up (required) |
+ mem_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Main memory bandwidth (required) |
- net_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Total fast interconnect network bandwidth (required) |
- file_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Total file IO bandwidth (required) |
- ipc | No | object | No | In embedfs://job-metric-statistics.schema.json | Instructions executed per cycle |
+ cpu_user | No | object | No | In embedfs://job-metric-statistics.schema.json | CPU user active core utilization |
- flops_dp | No | object | No | In embedfs://job-metric-statistics.schema.json | Double precision flop rate |
- flops_sp | No | object | No | In embedfs://job-metric-statistics.schema.json | Single precision flops rate |
- rapl_power | No | object | No | In embedfs://job-metric-statistics.schema.json | CPU power consumption |
- acc_used | No | object | No | In embedfs://job-metric-statistics.schema.json | GPU utilization |
- acc_mem_used | No | object | No | In embedfs://job-metric-statistics.schema.json | GPU memory capacity used |
- acc_power | No | object | No | In embedfs://job-metric-statistics.schema.json | GPU power consumption |
- clock | No | object | No | In embedfs://job-metric-statistics.schema.json | Average core frequency |
- eth_read_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Ethernet read bandwidth |
- eth_write_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Ethernet write bandwidth |
- ic_rcv_packets | No | object | No | In embedfs://job-metric-statistics.schema.json | Network interconnect read packets |
- ic_send_packets | No | object | No | In embedfs://job-metric-statistics.schema.json | Network interconnect send packet |
- ic_read_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Network interconnect read bandwidth |
- ic_write_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | Network interconnect write bandwidth |
- filesystems | No | array of object | No | - | Array of filesystems |
21.1. Property Job meta data > statistics > mem_used
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Memory capacity used (required)
21.2. Property Job meta data > statistics > cpu_load
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: CPU requested core utilization (load 1m) (required)
21.3. Property Job meta data > statistics > flops_any
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Total flop rate with DP flops scaled up (required)
21.4. Property Job meta data > statistics > mem_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Main memory bandwidth (required)
21.5. Property Job meta data > statistics > net_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Total fast interconnect network bandwidth (required)
21.6. Property Job meta data > statistics > file_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Total file IO bandwidth (required)
21.7. Property Job meta data > statistics > ipc
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Instructions executed per cycle
21.8. Property Job meta data > statistics > cpu_user
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: CPU user active core utilization
21.9. Property Job meta data > statistics > flops_dp
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Double precision flop rate
21.10. Property Job meta data > statistics > flops_sp
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Single precision flops rate
21.11. Property Job meta data > statistics > rapl_power
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: CPU power consumption
21.12. Property Job meta data > statistics > acc_used
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: GPU utilization
21.13. Property Job meta data > statistics > acc_mem_used
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: GPU memory capacity used
21.14. Property Job meta data > statistics > acc_power
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: GPU power consumption
21.15. Property Job meta data > statistics > clock
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Average core frequency
21.16. Property Job meta data > statistics > eth_read_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Ethernet read bandwidth
21.17. Property Job meta data > statistics > eth_write_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Ethernet write bandwidth
21.18. Property Job meta data > statistics > ic_rcv_packets
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Network interconnect read packets
21.19. Property Job meta data > statistics > ic_send_packets
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Network interconnect send packet
21.20. Property Job meta data > statistics > ic_read_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Network interconnect read bandwidth
21.21. Property Job meta data > statistics > ic_write_bw
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: Network interconnect write bandwidth
21.22. Property Job meta data > statistics > filesystems
Type | array of object |
Required | No |
Description: Array of filesystems
Array restrictions | |
---|---|
Min items | 1 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
filesystems items | - |
21.22.1. Job meta data > statistics > filesystems > filesystems items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ name | No | string | No | - | - |
+ type | No | enum (of string) | No | - | - |
+ read_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | File system read bandwidth |
+ write_bw | No | object | No | In embedfs://job-metric-statistics.schema.json | File system write bandwidth |
- read_req | No | object | No | In embedfs://job-metric-statistics.schema.json | File system read requests |
- write_req | No | object | No | In embedfs://job-metric-statistics.schema.json | File system write requests |
- inodes | No | object | No | In embedfs://job-metric-statistics.schema.json | File system write requests |
- accesses | No | object | No | In embedfs://job-metric-statistics.schema.json | File system open and close |
- fsync | No | object | No | In embedfs://job-metric-statistics.schema.json | File system fsync |
- create | No | object | No | In embedfs://job-metric-statistics.schema.json | File system create |
- open | No | object | No | In embedfs://job-metric-statistics.schema.json | File system open |
- close | No | object | No | In embedfs://job-metric-statistics.schema.json | File system close |
- seek | No | object | No | In embedfs://job-metric-statistics.schema.json | File system seek |
21.22.1.1. Property Job meta data > statistics > filesystems > filesystems items > name
Type | string |
Required | Yes |
21.22.1.2. Property Job meta data > statistics > filesystems > filesystems items > type
Type | enum (of string) |
Required | Yes |
Must be one of:
- “nfs”
- “lustre”
- “gpfs”
- “nvme”
- “ssd”
- “hdd”
- “beegfs”
21.22.1.3. Property Job meta data > statistics > filesystems > filesystems items > read_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system read bandwidth
21.22.1.4. Property Job meta data > statistics > filesystems > filesystems items > write_bw
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system write bandwidth
21.22.1.5. Property Job meta data > statistics > filesystems > filesystems items > read_req
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system read requests
21.22.1.6. Property Job meta data > statistics > filesystems > filesystems items > write_req
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system write requests
21.22.1.7. Property Job meta data > statistics > filesystems > filesystems items > inodes
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system write requests
21.22.1.8. Property Job meta data > statistics > filesystems > filesystems items > accesses
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system open and close
21.22.1.9. Property Job meta data > statistics > filesystems > filesystems items > fsync
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system fsync
21.22.1.10. Property Job meta data > statistics > filesystems > filesystems items > create
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system create
21.22.1.11. Property Job meta data > statistics > filesystems > filesystems items > open
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system open
21.22.1.12. Property Job meta data > statistics > filesystems > filesystems items > close
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system close
21.22.1.13. Property Job meta data > statistics > filesystems > filesystems items > seek
Type | object |
Required | No |
Additional properties | Any type allowed |
Defined in | embedfs://job-metric-statistics.schema.json |
Description: File system seek
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.1.7.7 - Job Archive Metrics Data Schema
The following schema in its raw form can be found in the ClusterCockpit GitHub repository.
Manual Updates
Changes to the original JSON schema found in the repository are not automatically rendered in this reference documentation.Last Update: 04.12.2024Job metric data
- 1. Property
Job metric data > unit
- 2. Property
Job metric data > timestep
- 3. Property
Job metric data > thresholds
- 4. Property
Job metric data > statisticsSeries
- 4.1. Property
Job metric data > statisticsSeries > min
- 4.2. Property
Job metric data > statisticsSeries > max
- 4.3. Property
Job metric data > statisticsSeries > mean
- 4.4. Property
Job metric data > statisticsSeries > percentiles
- 4.4.1. Property
Job metric data > statisticsSeries > percentiles > 10
- 4.4.2. Property
Job metric data > statisticsSeries > percentiles > 20
- 4.4.3. Property
Job metric data > statisticsSeries > percentiles > 30
- 4.4.4. Property
Job metric data > statisticsSeries > percentiles > 40
- 4.4.5. Property
Job metric data > statisticsSeries > percentiles > 50
- 4.4.6. Property
Job metric data > statisticsSeries > percentiles > 60
- 4.4.7. Property
Job metric data > statisticsSeries > percentiles > 70
- 4.4.8. Property
Job metric data > statisticsSeries > percentiles > 80
- 4.4.9. Property
Job metric data > statisticsSeries > percentiles > 90
- 4.4.10. Property
Job metric data > statisticsSeries > percentiles > 25
- 4.4.11. Property
Job metric data > statisticsSeries > percentiles > 75
- 4.4.1. Property
- 4.1. Property
- 5. Property
Job metric data > series
- 5.1. Job metric data > series > series items
Title: Job metric data
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Metric data of a HPC job
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ unit | No | object | No | In embedfs://unit.schema.json | Metric unit |
+ timestep | No | integer | No | - | Measurement interval in seconds |
- thresholds | No | object | No | - | Metric thresholds for specific system |
- statisticsSeries | No | object | No | - | Statistics series across topology |
+ series | No | array of object | No | - | - |
1. Property Job metric data > unit
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Defined in | embedfs://unit.schema.json |
Description: Metric unit
2. Property Job metric data > timestep
Type | integer |
Required | Yes |
Description: Measurement interval in seconds
3. Property Job metric data > thresholds
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Metric thresholds for specific system
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- peak | No | number | No | - | - |
- normal | No | number | No | - | - |
- caution | No | number | No | - | - |
- alert | No | number | No | - | - |
3.1. Property Job metric data > thresholds > peak
Type | number |
Required | No |
3.2. Property Job metric data > thresholds > normal
Type | number |
Required | No |
3.3. Property Job metric data > thresholds > caution
Type | number |
Required | No |
3.4. Property Job metric data > thresholds > alert
Type | number |
Required | No |
4. Property Job metric data > statisticsSeries
Type | object |
Required | No |
Additional properties | Any type allowed |
Description: Statistics series across topology
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- min | No | array of number | No | - | - |
- max | No | array of number | No | - | - |
- mean | No | array of number | No | - | - |
- percentiles | No | object | No | - | - |
4.1. Property Job metric data > statisticsSeries > min
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
min items | - |
4.1.1. Job metric data > statisticsSeries > min > min items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.2. Property Job metric data > statisticsSeries > max
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
max items | - |
4.2.1. Job metric data > statisticsSeries > max > max items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.3. Property Job metric data > statisticsSeries > mean
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
mean items | - |
4.3.1. Job metric data > statisticsSeries > mean > mean items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4. Property Job metric data > statisticsSeries > percentiles
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
- 10 | No | array of number | No | - | - |
- 20 | No | array of number | No | - | - |
- 30 | No | array of number | No | - | - |
- 40 | No | array of number | No | - | - |
- 50 | No | array of number | No | - | - |
- 60 | No | array of number | No | - | - |
- 70 | No | array of number | No | - | - |
- 80 | No | array of number | No | - | - |
- 90 | No | array of number | No | - | - |
- 25 | No | array of number | No | - | - |
- 75 | No | array of number | No | - | - |
4.4.1. Property Job metric data > statisticsSeries > percentiles > 10
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
10 items | - |
4.4.1.1. Job metric data > statisticsSeries > percentiles > 10 > 10 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.2. Property Job metric data > statisticsSeries > percentiles > 20
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
20 items | - |
4.4.2.1. Job metric data > statisticsSeries > percentiles > 20 > 20 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.3. Property Job metric data > statisticsSeries > percentiles > 30
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
30 items | - |
4.4.3.1. Job metric data > statisticsSeries > percentiles > 30 > 30 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.4. Property Job metric data > statisticsSeries > percentiles > 40
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
40 items | - |
4.4.4.1. Job metric data > statisticsSeries > percentiles > 40 > 40 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.5. Property Job metric data > statisticsSeries > percentiles > 50
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
50 items | - |
4.4.5.1. Job metric data > statisticsSeries > percentiles > 50 > 50 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.6. Property Job metric data > statisticsSeries > percentiles > 60
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
60 items | - |
4.4.6.1. Job metric data > statisticsSeries > percentiles > 60 > 60 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.7. Property Job metric data > statisticsSeries > percentiles > 70
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
70 items | - |
4.4.7.1. Job metric data > statisticsSeries > percentiles > 70 > 70 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.8. Property Job metric data > statisticsSeries > percentiles > 80
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
80 items | - |
4.4.8.1. Job metric data > statisticsSeries > percentiles > 80 > 80 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.9. Property Job metric data > statisticsSeries > percentiles > 90
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
90 items | - |
4.4.9.1. Job metric data > statisticsSeries > percentiles > 90 > 90 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.10. Property Job metric data > statisticsSeries > percentiles > 25
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
25 items | - |
4.4.10.1. Job metric data > statisticsSeries > percentiles > 25 > 25 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
4.4.11. Property Job metric data > statisticsSeries > percentiles > 75
Type | array of number |
Required | No |
Array restrictions | |
---|---|
Min items | 3 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
75 items | - |
4.4.11.1. Job metric data > statisticsSeries > percentiles > 75 > 75 items
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
5. Property Job metric data > series
Type | array of object |
Required | Yes |
Array restrictions | |
---|---|
Min items | N/A |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
Each item of this array must be | Description |
---|---|
series items | - |
5.1. Job metric data > series > series items
Type | object |
Required | No |
Additional properties | Any type allowed |
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ hostname | No | string | No | - | - |
- id | No | string | No | - | - |
+ statistics | No | object | No | - | Statistics across time dimension |
+ data | No | array | No | - | - |
5.1.1. Property Job metric data > series > series items > hostname
Type | string |
Required | Yes |
5.1.2. Property Job metric data > series > series items > id
Type | string |
Required | No |
5.1.3. Property Job metric data > series > series items > statistics
Type | object |
Required | Yes |
Additional properties | Any type allowed |
Description: Statistics across time dimension
Property | Pattern | Type | Deprecated | Definition | Title/Description |
---|---|---|---|---|---|
+ avg | No | number | No | - | Series average |
+ min | No | number | No | - | Series minimum |
+ max | No | number | No | - | Series maximum |
5.1.3.1. Property Job metric data > series > series items > statistics > avg
Type | number |
Required | Yes |
Description: Series average
Restrictions | |
---|---|
Minimum | ≥ 0 |
5.1.3.2. Property Job metric data > series > series items > statistics > min
Type | number |
Required | Yes |
Description: Series minimum
Restrictions | |
---|---|
Minimum | ≥ 0 |
5.1.3.3. Property Job metric data > series > series items > statistics > max
Type | number |
Required | Yes |
Description: Series maximum
Restrictions | |
---|---|
Minimum | ≥ 0 |
5.1.4. Property Job metric data > series > series items > data
Type | array |
Required | Yes |
Array restrictions | |
---|---|
Min items | 1 |
Max items | N/A |
Items unicity | False |
Additional items | False |
Tuple validation | See below |
5.1.4.1. At least one of the items must be
Type | number |
Required | No |
Restrictions | |
---|---|
Minimum | ≥ 0 |
Generated using json-schema-for-humans on 2024-12-04 at 16:45:59 +0100
7.2 - Metric Store
Reference information regarding the ClusterCockpit component “cc-metric-store” (GitHub Repo).
7.2.1 - Command Line
This page describes the command line options for the cc-metric-store
executable.
-config <path>
Function: Specifies alternative path to application configuration file.
Default: ./config.json
Example: -config ./configfiles/configuration.json
-dev
Function: Enables the Swagger UI REST API documentation and playground
-gops
Function: Go server listens via github.com/google/gops/agent (for debugging).
-version
Function: Shows version information and exits.
Example config:
{
"metrics": {
"debug_metric": {
"frequency": 60,
"aggregation": "avg"
},
"clock": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_idle": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_iowait": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_irq": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_system": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_user": {
"frequency": 60,
"aggregation": "avg"
},
"nv_mem_util": {
"frequency": 60,
"aggregation": "avg"
},
"nv_temp": {
"frequency": 60,
"aggregation": "avg"
},
"nv_sm_clock": {
"frequency": 60,
"aggregation": "avg"
},
"acc_utilization": {
"frequency": 60,
"aggregation": "avg"
},
"acc_mem_used": {
"frequency": 60,
"aggregation": "sum"
},
"acc_power": {
"frequency": 60,
"aggregation": "sum"
},
"flops_any": {
"frequency": 60,
"aggregation": "sum"
},
"flops_dp": {
"frequency": 60,
"aggregation": "sum"
},
"flops_sp": {
"frequency": 60,
"aggregation": "sum"
},
"ib_recv": {
"frequency": 60,
"aggregation": "sum"
},
"ib_xmit": {
"frequency": 60,
"aggregation": "sum"
},
"ib_recv_pkts": {
"frequency": 60,
"aggregation": "sum"
},
"ib_xmit_pkts": {
"frequency": 60,
"aggregation": "sum"
},
"cpu_power": {
"frequency": 60,
"aggregation": "sum"
},
"core_power": {
"frequency": 60,
"aggregation": "sum"
},
"mem_power": {
"frequency": 60,
"aggregation": "sum"
},
"ipc": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_load": {
"frequency": 60,
"aggregation": null
},
"lustre_close": {
"frequency": 60,
"aggregation": null
},
"lustre_open": {
"frequency": 60,
"aggregation": null
},
"lustre_statfs": {
"frequency": 60,
"aggregation": null
},
"lustre_read_bytes": {
"frequency": 60,
"aggregation": null
},
"lustre_write_bytes": {
"frequency": 60,
"aggregation": null
},
"net_bw": {
"frequency": 60,
"aggregation": null
},
"file_bw": {
"frequency": 60,
"aggregation": null
},
"mem_bw": {
"frequency": 60,
"aggregation": "sum"
},
"mem_cached": {
"frequency": 60,
"aggregation": null
},
"mem_used": {
"frequency": 60,
"aggregation": null
},
"vectorization_ratio": {
"frequency": 60,
"aggregation": "avg"
}
},
"checkpoints": {
"interval": "1h",
"directory": "./var/checkpoints",
"restore": "1h"
},
"archive": {
"interval": "24h",
"directory": "./var/archive"
},
"http-api": {
"address": "localhost:8082",
"https-cert-file": null,
"https-key-file": null
},
"retention-in-memory": "48h",
"nats": null,
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
}
7.2.2 - Configuration
All durations are specified as string that will be parsed like this (Allowed suffixes: s
, m
, h
, …).
metrics
: Map of metric-name to objects with the following propertiesfrequency
: Timestep/Interval/Resolution of this metricaggregation
: Can be"sum"
,"avg"
ornull
null
means aggregation across nodes is forbidden for this metric"sum"
means that values from the child levels are summed up for the parent level"avg"
means that values from the child levels are averaged for the parent level
scope
: Unused at the moment, should be something like"node"
,"socket"
or"hwthread"
nats
:address
: Url of NATS.io server, example: “nats://localhost:4222”username
andpassword
: Optional, if provided use those for the connectionsubscriptions
:subscribe-to
: Where to expect the measurements to be publishedcluster-tag
: Default value for the cluster tag
http-api
:address
: Address to bind to, for example0.0.0.0:8080
https-cert-file
andhttps-key-file
: Optional, if provided enable HTTPS using those files as certificate/key
jwt-public-key
: Base64 encoded string, use this to verify requests to the HTTP APIretention-on-memory
: Keep all values in memory for at least that amount of timecheckpoints
:interval
: Do checkpoints every X seconds/minutes/hoursdirectory
: Path to a directoryrestore
: After a restart, load the last X seconds/minutes/hours of data back into memory
archive
:interval
: Move and compress all checkpoints not needed anymore every X seconds/minutes/hoursdirectory
: Path to a directory
7.2.3 - Metric Store REST API
Authentication
JWT tokens
cc-metric-store
supports only JWT tokens using the EdDSA/Ed25519 signing
method. The token is provided using the Authorization Bearer header.
Example script to test the endpoint:
#Only use JWT token if the JWT authentication has been setup
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
curl -X 'GET' 'http://localhost:8081/api/query/' -H "Authorization: Bearer $JWT" -d "{ \"cluster\": \"alex\", \"from\": 1720879275, \"to\": 1720964715, \"queries\": [{\"metric\": \"cpu_load\",\"host\": \"a0124\"}] }"
NATS
TODO
Usage of Swagger UI
This Swagger UI is also available as part of cc-metric-store
if you start it
with the dev
option:
./cc-metric-store -dev
You may access it at this URL.
Payload format for write endpoint
The data comes in Influx DB line protocol format.
<metric>,cluster=<cluster>,hostname=<hostname>,type=<node/hwthread/etc> value=<value> <epoch_time_in_ns_or_s>
Real example:
proc_run,cluster=fritz,hostname=f2163,type=node value=4i 1725620476214474893
A more detailed description of the ClusterCockpit flavored Influx DB line protocol and their types can be found here in CC specification.
Example script to test endpoint:
#Only use JWT token if the JWT authentication has been setup
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
curl -X 'GET' 'http://localhost:8081/api/write/?cluster=alex' -H "Authorization: Bearer $JWT" -d "proc_run,cluster=fritz,hostname=f2163,type=node value=4i 1725620476214474893"
Usage of Swagger UI
This Swagger UI is also available as part of cc-metric-store
if you start it
with the dev
option:
./cc-metric-store -dev
You may access it at this URL.
Swagger API Reference
Non-Interactive Documentation
This reference is rendered using theswagger-ui
plugin based on the original definition file found in the ClusterCockpit
repository,
but without a serving backend.This means that all interactivity (“Try It Out”) will not return actual data.
However, a Curl
call and a compiled Request URL
will still be displayed, if
an API endpoint is executed.7.3 - cc-event-store
cc-event-store
A simple short-term store for job and system events as well as logs in the ClusterCockpit ecosystem. Event and Logs were introduced
as an extension to the previous CCMetric
messages, numeric data from the compute nodes known as metrics (see lineprotocol
specifcation at cc-specification). Events and Logs are strings and in contrast to the periodic sending of metric from the cc-metric-collector
, events and logs can happen at any time. All storage backends have a configuration option for the retention time for which events should be kept. Logs are never deleted.
Configuration
{
"receiver" : "/path/to/receiver/config/file",
"storage" : "/path/to/storage/config/file",
"api" : "/path/to/api/config/file"
}
For the format of each file, see here:
Structure
The cc-event-store
has 4 components that are coupled together in the binary.
- The event and log message receivers are reused from
cc-metric-collector
. There they are used to receive metrics from remote targets but are flexible enough to receive events and logs as well. Seecc-metric-collector
’s receivers. - The router forwards the events and logs to the storage manager.
- The storage manager is a frontend to some database backends like SQLite or Postgres. The SQLite backend is the main development target.
- The REST API is mainly used to query the storage backends but can also be used to insert events and logs.
This also explains why cc-event-store
uses multiple configuration files, all coupled by a central configuration file. Each component has its own configuration file which makes it possible to reuse the receivers from cc-metric-collector
without any changes, it just requires its configuration file.
7.3.1 - cc-event-store's REST API
Configuration
{
"address" : "localhost",
"port": "8088",
"idle_timeout": "120s",
"keep_alives_enabled": true,
"jwt_public_key": "0123456789ABCDEF",
"enable_swagger_ui": true
}
address
: Hostname or IP to listen for requestsport
: Port number (as string) to listen atidle_timeout
: Close connection after this time. Must be a parseable time fortime.ParseDuration
keep_alives_enabled
: Keep connections alive for some timejwt_public_key
: JWT public key used for authenticationenable_swagger_ui
: Enable the Swagger UI, a web-based documentation of the REST API
Endpoints
http://address:port/api/query
http://address:port/api/write?cluster=<cluster>
See generated Swagger documentation or web-based Swagger UI for more information and the data format accepted by the endpoints
7.3.2 - cc-event-store's storage backends
Storage component
This component contains different backends for storing CCEvent
and CCLog
messages. The this in only a short term storage, so all backends have a notion of retention time to delete older entries.
Backends
Each backend uses it’s own configuration file entries. Check the backend-specific page for more information.
7.3.2.1 - Storage backend for Postgres
Storage backend for Postgres
Configuration
{
"type" : "postgres",
"server": "127.0.0.1",
"port": 5432,
"database_path" : "database_name",
"flags" : [
"open_flag=X"
],
"username" : "myuser",
"password" : "mypass",
"connection_timeout" : 1
}
type
: Has to bepostgres
server
: IP or name of server (defaultlocalhost
)port
: Port number of server (default5432
)database_path
: The backed connects to this databaseflags
: Flags when opening Postgres. For things like connect settings (sslmode=verify-full
)username
: If given, the database is opened with the given usernamepassword
: If given andusername
is also given, use it to open the databaseconnection_timeout
: Timeout for connection in seconds (default1
)
Storage
The Postgres backend stores CCEvents
and CCLog
messages in distict tables named <cluster>_events
and <cluster>_logs
respecively. It does not make use of distinct tables to hold specific and returning parts of CCEvents
and CCLog
messages (namely hostname
tag, type
tag and typeid
tag). The timestamps of the messages are stored as UNIX timestamps with precision in seconds.
7.3.2.2 - Storage backend for SQLite3
Storage backend for SQLite3
Configuration
{
"type" : "sqlite",
"database_path" : "/path/for/databases",
"flags" : [
"open_flag=X"
],
"username" : "myuser",
"password" : "mypass"
}
type
: Has to besqlite
database_path
: The backed creates tables based on the cluster names in this pathflags
: Flags when opening SQLite. For things like timeouts (_timeout=5000
), storage settings (_journal=WAL
), …username
: If given, the database is opened with the given usernamepassword
: If given andusername
is also given, use it to open the database
Storage
The Sqlite backend stores CCEvents
and CCLog
messages in distict tables named <cluster>_events
and <cluster>_logs
respecively. It does not make use of distinct tables to hold specific and returning parts of CCEvents
and CCLog
messages (namely hostname
tag, type
tag and typeid
tag). The timestamps of the messages are stored as UNIX timestamps with precision in seconds.
7.4 - cc-metric-collector
cc-metric-collector
A node agent for measuring, processing and forwarding node level metrics. It is part of the ClusterCockpit ecosystem.
The metric collector sends (and receives) metric in the InfluxDB line protocol as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).
There is a single timer loop that triggers all collectors serially, collects the collectors’ data and sends the metrics to the sink. This is done as all data is submitted with a single time stamp. The sinks currently use mostly blocking APIs.
The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.
Configuration
Configuration is implemented using a single json document that is distributed over network and may be persisted as file. Supported metrics are documented here.
There is a main configuration file with basic settings that point to the other configuration files for the different components.
{
"sinks": "sinks.json",
"collectors" : "collectors.json",
"receivers" : "receivers.json",
"router" : "router.json",
"interval": "10s",
"duration": "1s"
}
The interval
defines how often the metrics should be read and send to the sink. The duration
tells collectors how long one measurement has to take. This is important for some collectors, like the likwid
collector. For more information, see here.
See the component READMEs for their configuration:
Installation
$ git clone git@github.com:ClusterCockpit/cc-metric-collector.git
$ make (downloads LIKWID, builds it as static library with 'direct' accessmode and copies all required files for the collector)
$ go get (requires at least golang 1.16)
$ make
For more information, see here.
Running
$ ./cc-metric-collector --help
Usage of metric-collector:
-config string
Path to configuration file (default "./config.json")
-log string
Path for logfile (default "stderr")
-once
Run all collectors only once
Scenarios
The metric collector was designed with flexibility in mind, so it can be used in many scenarios. Here are a few:
flowchart TD subgraph a ["Cluster A"] nodeA[NodeA with CC collector] nodeB[NodeB with CC collector] nodeC[NodeC with CC collector] end a --> db[(Database)] db <--> ccweb("Webfrontend")
flowchart TD subgraph a [ClusterA] direction LR nodeA[NodeA with CC collector] nodeB[NodeB with CC collector] nodeC[NodeC with CC collector] end subgraph b [ClusterB] direction LR nodeD[NodeD with CC collector] nodeE[NodeE with CC collector] nodeF[NodeF with CC collector] end a --> ccrecv{"CC collector as receiver"} b --> ccrecv ccrecv --> db[("Database1")] ccrecv -.-> db2[("Database2")] db <-.-> ccweb("Webfrontend")
Contributing
The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into the cc-metric-collector to gather all desired metrics.
You are free to open an issue to request a collector but we would also be happy about PRs.
Contact
7.4.1 - cc-metric-collector's collectors
CCMetric collectors
This folder contains the collectors for the cc-metric-collector.
Configuration
{
"collector_type" : {
<collector specific configuration>
}
}
In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn’t manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.
Available collectors
cpustat
memstat
iostat
diskstat
loadavg
netstat
ibstat
ibstat_perfquery
tempstat
lustrestat
likwid
nvidia
customcmd
ipmistat
topprocs
nfs3stat
nfs4stat
cpufreq
cpufreq_cpuinfo
numastats
gpfs
beegfs_meta
beegfs_storage
rocm_smi
Todos
- Aggreate metrics to higher topology entity (sum hwthread metrics to socket metric, …). Needs to be configurable
Contributing own collectors
A collector reads data from any source, parses it to metrics and submits these metrics to the metric-collector
. A collector provides three function:
Name() string
: Return the name of the collectorInit(config json.RawMessage) error
: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, …Initialized() bool
: Check if a collector is successfully initializedRead(duration time.Duration, output chan ccMetric.CCMetric)
: Read, parse and submit data to theoutput
channel asCCMetric
. If the collector has to measure anything for some duration, use the provided function argumentduration
.Close()
: Closes down the collector.
It is recommanded to call setup()
in the Init()
function.
Finally, the collector needs to be registered in the collectorManager.go
. There is a list of collectors called AvailableCollectors
which is a map (collector_type_string
-> pointer to MetricCollector interface
). Add a new entry with a descriptive name and the new collector.
Sample collector
package collectors
import (
"encoding/json"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
// Struct for the collector-specific JSON config
type SampleCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics"`
}
type SampleCollector struct {
metricCollector
config SampleCollectorConfig
}
func (m *SampleCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
m.name = "SampleCollector"
m.setup()
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
m.meta = map[string]string{"source": m.name, "group": "Sample"}
m.init = true
return nil
}
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
// tags for the metric, if type != node use proper type and type-id
tags := map[string]string{"type" : "node"}
x, err := GetMetric()
if err != nil {
cclog.ComponentError(m.name, fmt.Sprintf("Read(): %v", err))
}
// Each metric has exactly one field: value !
value := map[string]interface{}{"value": int64(x)}
if y, err := lp.New("sample_metric", tags, m.meta, value, time.Now()); err == nil {
output <- y
}
}
func (m *SampleCollector) Close() {
m.init = false
return
}
7.4.1.1 - BeeGFS on Demand collector
BeeGFS on Demand
collector
This Collector is to collect BeeGFS on Demand (BeeOND) metadata clientstats.
"beegfs_meta": {
"beegfs_path": "/usr/bin/beegfs-ctl",
"exclude_filesystem": [
"/mnt/ignore_me"
],
"exclude_metrics": [
"ack",
"entInf",
"fndOwn"
]
}
The BeeGFS On Demand (BeeOND)
collector uses the beegfs-ctl
command to read performance metrics for
BeeGFS filesystems.
The reported filesystems can be filtered with the exclude_filesystem
option
in the configuration.
The path to the beegfs-ctl
command can be configured with the beegfs_path
option
in the configuration.
When using the exclude_metrics
option, the excluded metrics are summed as other
.
Important: The metrics listed below, are similar to the naming of BeeGFS. The Collector prefixes these with beegfs_cstorage
(beegfs client storage).
For example beegfs metric open
-> beegfs_cstorage_open
Available Metrics:
- sum
- ack
- close
- entInf
- fndOwn
- mkdir
- create
- rddir
- refrEnt
- mdsInf
- rmdir
- rmLnk
- mvDirIns
- mvFiIns
- open
- ren
- sChDrct
- sAttr
- sDirPat
- stat
- statfs
- trunc
- symlnk
- unlnk
- lookLI
- statLI
- revalLI
- openLI
- createLI
- hardlnk
- flckAp
- flckEn
- flckRg
- dirparent
- listXA
- getXA
- rmXA
- setXA
- mirror
The collector adds a filesystem
tag to all metrics
7.4.1.2 - BeeGFS on Demand collector
BeeGFS on Demand
collector
This Collector is to collect BeeGFS on Demand (BeeOND) storage stats.
"beegfs_storage": {
"beegfs_path": "/usr/bin/beegfs-ctl",
"exclude_filesystem": [
"/mnt/ignore_me"
],
"exclude_metrics": [
"ack",
"storInf",
"unlnk"
]
}
The BeeGFS On Demand (BeeOND)
collector uses the beegfs-ctl
command to read performance metrics for BeeGFS filesystems.
The reported filesystems can be filtered with the exclude_filesystem
option
in the configuration.
The path to the beegfs-ctl
command can be configured with the beegfs_path
option
in the configuration.
When using the exclude_metrics
option, the excluded metrics are summed as other
.
Important: The metrics listed below, are similar to the naming of BeeGFS. The Collector prefixes these with beegfs_cstorage_
(beegfs client meta).
For example beegfs metric open
-> beegfs_cstorage_
Note: BeeGFS FS offers many Metadata Information. Probably it makes sense to exlcude most of them. Nevertheless, these excluded metrics will be summed as beegfs_cstorage_other
.
Available Metrics:
- “sum”
- “ack”
- “sChDrct”
- “getFSize”
- “sAttr”
- “statfs”
- “trunc”
- “close”
- “fsync”
- “ops-rd”
- “MiB-rd/s”
- “ops-wr”
- “MiB-wr/s”
- “endbg”
- “hrtbeat”
- “remNode”
- “storInf”
- “unlnk”
The collector adds a filesystem
tag to all metrics
7.4.1.3 - cpufreq_cpuinfo collector
cpufreq_cpuinfo
collector
"cpufreq_cpuinfo": {}
The cpufreq_cpuinfo
collector reads the clock frequency from /proc/cpuinfo
and outputs a handful hwthread metrics.
Metrics:
cpufreq
7.4.1.4 - cpufreq_cpuinfo collector
cpufreq_cpuinfo
collector
"cpufreq": {
"exclude_metrics": []
}
The cpufreq
collector reads the clock frequency from /sys/devices/system/cpu/cpu*/cpufreq
and outputs a handful hwthread metrics.
Metrics:
cpufreq
7.4.1.5 - cpustat collector
cpustat
collector
"cpustat": {
"exclude_metrics": [
"cpu_idle"
]
}
The cpustat
collector reads data from /proc/stat
and outputs a handful node and hwthread metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
cpu_user
withunit=Percent
cpu_nice
withunit=Percent
cpu_system
withunit=Percent
cpu_idle
withunit=Percent
cpu_iowait
withunit=Percent
cpu_irq
withunit=Percent
cpu_softirq
withunit=Percent
cpu_steal
withunit=Percent
cpu_guest
withunit=Percent
cpu_guest_nice
withunit=Percent
cpu_used
=cpu_* - cpu_idle
withunit=Percent
num_cpus
7.4.1.6 - customcmd collector
customcmd
collector
"customcmd": {
"exclude_metrics": [
"mymetric"
],
"files" : [
"/var/run/myapp.metrics"
],
"commands" : [
"/usr/local/bin/getmetrics.pl"
]
}
The customcmd
collector reads data from files and the output of executed commands. The files and commands can output multiple metrics (separated by newline) but the have to be in the InfluxDB line protocol. If a metric is not parsable, it is skipped. If a metric is not required, it can be excluded from forwarding it to the sink.
7.4.1.7 - diskstat collector
diskstat
collector
"diskstat": {
"exclude_metrics": [
"disk_total"
],
}
The diskstat
collector reads data from /proc/self/mounts
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics per device (with device
tag):
disk_total
(unitGBytes
)disk_free
(unitGBytes
)
Global metrics:
part_max_used
(unitpercent
)
7.4.1.8 - gpfs collector
gpfs
collector
"ibstat": {
"mmpmon_path": "/path/to/mmpmon",
"exclude_filesystem": [
"fs1"
],
"send_bandwidths": true,
"send_total_values": true
}
The gpfs
collector uses the mmpmon
command to read performance metrics for
GPFS / IBM Spectrum Scale filesystems.
The reported filesystems can be filtered with the exclude_filesystem
option
in the configuration.
The path to the mmpmon
command can be configured with the mmpmon_path
option
in the configuration. If nothing is set, the collector searches in $PATH
for mmpmon
.
Metrics:
gpfs_bytes_read
gpfs_bytes_written
gpfs_num_opens
gpfs_num_closes
gpfs_num_reads
gpfs_num_writes
gpfs_num_readdirs
gpfs_num_inode_updates
gpfs_bytes_total = gpfs_bytes_read + gpfs_bytes_written
(ifsend_total_values == true
)gpfs_iops = gpfs_num_reads + gpfs_num_writes
(ifsend_total_values == true
)gpfs_metaops = gpfs_num_inode_updates + gpfs_num_closes + gpfs_num_opens + gpfs_num_readdirs
(ifsend_total_values == true
)gpfs_bw_read
(ifsend_bandwidths == true
)gpfs_bw_write
(ifsend_bandwidths == true
)
The collector adds a filesystem
tag to all metrics
7.4.1.9 - ibstat collector
ibstat
collector
"ibstat": {
"exclude_devices": [
"mlx4"
],
"send_abs_values": true,
"send_derived_values": true
}
The ibstat
collector includes all Infiniband devices that can be
found below /sys/class/infiniband/
and where any of the ports provides a
LID file (/sys/class/infiniband/<dev>/ports/<port>/lid
)
The devices can be filtered with the exclude_devices
option in the configuration.
For each found LID the collector reads data through the sysfs files below /sys/class/infiniband/<device>
. (See: https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-class-infiniband)
Metrics:
ib_recv
ib_xmit
ib_recv_pkts
ib_xmit_pkts
ib_total = ib_recv + ib_xmit
(ifsend_total_values == true
)ib_total_pkts = ib_recv_pkts + ib_xmit_pkts
(ifsend_total_values == true
)ib_recv_bw
(ifsend_derived_values == true
)ib_xmit_bw
(ifsend_derived_values == true
)ib_recv_pkts_bw
(ifsend_derived_values == true
)ib_xmit_pkts_bw
(ifsend_derived_values == true
)
The collector adds a device
tag to all metrics
7.4.1.10 - iostat collector
iostat
collector
"iostat": {
"exclude_metrics": [
"read_ms"
],
}
The iostat
collector reads data from /proc/diskstats
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
io_reads
io_reads_merged
io_read_sectors
io_read_ms
io_writes
io_writes_merged
io_writes_sectors
io_writes_ms
io_ioops
io_ioops_ms
io_ioops_weighted_ms
io_discards
io_discards_merged
io_discards_sectors
io_discards_ms
io_flushes
io_flushes_ms
The device name is added as tag device
. For more details, see https://www.kernel.org/doc/html/latest/admin-guide/iostats.html
7.4.1.11 - ipmistat collector
ipmistat
collector
"ipmistat": {
"ipmitool_path": "/path/to/ipmitool",
"ipmisensors_path": "/path/to/ipmi-sensors",
}
The ipmistat
collector reads data from ipmitool
(ipmitool sensor
) or ipmi-sensors
(ipmi-sensors --sdr-cache-recreate --comma-separated-output
).
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.
7.4.1.12 - likwid collector
likwid
collector
The likwid
collector is probably the most complicated collector. The LIKWID library is included as static library with direct access mode. The direct access mode is suitable if the daemon is executed by a root user. The static library does not contain the performance groups, so all information needs to be provided in the configuration.
"likwid": {
"force_overwrite" : false,
"invalid_to_zero" : false,
"liblikwid_path" : "/path/to/liblikwid.so",
"accessdaemon_path" : "/folder/that/contains/likwid-accessD",
"access_mode" : "direct or accessdaemon or perf_event",
"lockfile_path" : "/var/run/likwid.lock",
"eventsets": [
{
"events" : {
"COUNTER0": "EVENT0",
"COUNTER1": "EVENT1"
},
"metrics" : [
{
"name": "sum_01",
"calc": "COUNTER0 + COUNTER1",
"publish": false,
"unit": "myunit",
"type": "hwthread"
}
]
}
],
"globalmetrics" : [
{
"name": "global_sum",
"calc": "sum_01",
"publish": true,
"unit": "myunit",
"type": "hwthread"
}
]
}
The likwid
configuration consists of two parts, the eventsets
and globalmetrics
:
- An event set list itself has two parts, the
events
and a set of derivablemetrics
. Each of theevents
is acounter:event
pair in LIKWID’s syntax. Themetrics
are a list of formulas to derive the metric value from the measurements of theevents
’ values. Each metric has a name, the formula, a type and a publish flag. There is an optionalunit
field. Counter names can be used like variables in the formulas, soPMC0+PMC1
sums the measurements for the both events configured in the countersPMC0
andPMC1
. You can optionally usetime
for the measurement time andinverseClock
for1.0/baseCpuFrequency
. The type tells the LikwidCollector whether it is a metric for each hardware thread (cpu
) or each CPU socket (socket
). You may specify a unit for the metric withunit
. The last one is the publishing flag. It tells the LikwidCollector whether a metric should be sent to the router or is only used internally to compute a global metric. - The
globalmetrics
are metrics which require data from multiple event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a type and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. Alsotime
andinverseClock
cannot be used anymore. So, the idea is to derive a metric in theeventsets
section and reuse it in theglobalmetrics
part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics ("publish": false
). Be aware that the combination might be misleading because the “behavior” of a metric changes over time and the multiple measurements might count different computing phases. Similar to the metrics in the eventset, you can specify a metric unit with theunit
field.
Additional options:
force_overwrite
: Same as settingLIKWID_FORCE=1
. In case counters are already in-use, LIKWID overwrites their configuration to do its measurementsinvalid_to_zero
: In some cases, the calculations result inNaN
orInf
. With this option, allNaN
andInf
values are replaces with0.0
. See below in seperate sectionaccess_mode
: Specify LIKWID access mode:direct
for direct register access as root user oraccessdaemon
. The access modeperf_event
is current untested.accessdaemon_path
: Folder of the accessDaemonlikwid-accessD
(like/usr/local/sbin
)liblikwid_path
: Location ofliblikwid.so
including file name like/usr/local/lib/liblikwid.so
lockfile_path
: Location of LIKWID’s lock file if multiple tools should access the hardware counters. Default/var/run/likwid.lock
Available metric types
Hardware performance counters are scattered all over the system nowadays. A counter coveres a specific part of the system. While there are hardware thread specific counter for CPU cycles, instructions and so on, some others are specific for a whole CPU socket/package. To address that, the LikwidCollector provides the specification of a type
for each metric.
hwthread
: One metric per CPU hardware thread with the tags"type" : "hwthread"
and"type-id" : "$hwthread_id"
socket
: One metric per CPU socket/package with the tags"type" : "socket"
and"type-id" : "$socket_id"
Note: You cannot specify socket
type for a metric that is measured at hwthread
type, so some kind of expert knowledge or lookup work in the Likwid Wiki is required. Get the type of each counter from the Architecture pages and as soon as one counter in a metric is socket-specific, the whole metric is socket-specific.
As a guideline:
- All counters
FIXCx
,PMCy
andTMAz
have the typehwthread
- All counters names containing
BOX
have the typesocket
- All
PWRx
counters have typesocket
, except"PWR1" : "RAPL_CORE_ENERGY"
hashwthread
type - All
DFCx
counters have typesocket
Help with the configuration
The configuration for the likwid
collector is quite complicated. Most users don’t use LIKWID with the event:counter notation but rely on the performance groups defined by the LIKWID team for each architecture. In order to help with the likwid
collector configuration, we included a script scripts/likwid_perfgroup_to_cc_config.py
that creates the configuration of an eventset
from a performance group (using a LIKWID installation in $PATH
):
$ likwid-perfctr -i
[...]
short name: ICX
[...]
$ likwid-perfctr -a
[...]
MEM_DP
MEM
FLOPS_SP
CLOCK
[...]
$ scripts/likwid_perfgroup_to_cc_config.py ICX MEM_DP
{
"events": {
"FIXC0": "INSTR_RETIRED_ANY",
"FIXC1": "CPU_CLK_UNHALTED_CORE",
"..." : "..."
},
"metrics" : [
{
"calc": "time",
"name": "Runtime (RDTSC) [s]",
"publish": true,
"unit": "seconds"
"type": "hwthread"
},
{
"..." : "..."
}
]
}
You can copy this JSON and add it to the eventsets
list. If you specify multiple event sets, you can add globally derived metrics in the extra global_metrics
section with the metric names as variables.
Mixed usage between daemon and users
LIKWID checks the file /var/run/likwid.lock
before performing any interfering operations. Who is allowed to access the counters is determined by the owner of the file. If it does not exist, it is created for the current user. So, if you want to temporarly allow counter access to a user (e.g. in a job):
Before (SLURM prolog, …)
chown $JOBUSER /var/run/likwid.lock
After (SLURM epilog, …)
chown $CCUSER /var/run/likwid.lock
invalid_to_zero
option
In some cases LIKWID returns 0.0
for some events that are further used in processing and maybe used as divisor in a calculation. After evaluation of a metric, the result might be NaN
or +-Inf
. These resulting metrics are commonly not created and forwarded to the router because the InfluxDB line protocol does not support these special floating-point values. If you want to have them sent, this option forces these metric values to be 0.0
instead.
One might think this does not happen often but often used metrics in the world of performance engineering like Instructions-per-Cycle (IPC) or more frequently the actual CPU clock are derived with events like CPU_CLK_UNHALTED_CORE
(Intel) which do not increment in halted state (as the name implies). In there are different power management systems in a chip which can cause a hardware thread to go in such a state. Moreover, if no cycles are executed by the core, also many other events are not incremented as well (like INSTR_RETIRED_ANY
for retired instructions and part of IPC).
lockfile_path
option
LIKWID can be configured with a lock file with which the access to the performance monitoring registers can be disabled (only the owner of the lock file is allowed to access the registers). When the lockfile_path
option is set, the collector subscribes to changes to this file to stop monitoring if the owner of the lock file changes. This feature is useful when users should be able to perform own hardware performance counter measurements through LIKWID or any other tool.
send_*_total values
option
send_core_total_values
: Metrics, which are usually collected on a per hardware thread basis, are additionally summed up per CPU core.send_socket_total_values
Metrics, which are usually collected on a per hardware thread basis, are additionally summed up per CPU socket.send_node_total_values
Metrics, which are usually collected on a per hardware thread basis, are additionally summed up per node.
Example configuration
AMD Zen3
"likwid": {
"force_overwrite" : false,
"invalid_to_zero" : false,
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"type": "hwthread",
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"unit": "MFlops/s",
"type": "hwthread",
"publish": true
},
{
"name": "clock",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"type": "hwthread",
"unit": "MHz",
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"unit": "Mbyte/s",
"type": "socket",
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"unit": "Watt"
"type": "socket",
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"type": "socket",
"unit": "Watt"
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"unit": "Mbyte/s",
"type": "socket",
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"type": "socket",
"unit": "Mbyte/s",
"publish": true
}
]
}
How to get the eventsets and metrics from LIKWID
The likwid
collector reads hardware performance counters at a hwthread and socket level. The configuration looks quite complicated but it is basically copy&paste from LIKWID’s performance groups. The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
The metrics are following the same procedure:
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "type": "hwthread",
-> "publish": true
-> }
-> ]
The script scripts/likwid_perfgroup_to_cc_config.py
might help you.
7.4.1.13 - loadavg collector
loadavg
collector
"loadavg": {
"exclude_metrics": [
"proc_run"
]
}
The loadavg
collector reads data from /proc/loadavg
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
load_one
load_five
load_fifteen
proc_run
proc_total
7.4.1.14 - lustrestat collector
lustrestat
collector
"lustrestat": {
"lctl_command": "/path/to/lctl",
"exclude_metrics": [
"setattr",
"getattr"
],
"send_abs_values" : true,
"send_derived_values" : true,
"send_diff_values": true,
"use_sudo": false
}
The lustrestat
collector uses the lctl
application with the get_param
option to get all llite
metrics (Lustre client). The llite
metrics are only available for root users. If password-less sudo is configured, you can enable sudo
in the configuration.
Metrics:
lustre_read_bytes
(unitbytes
)lustre_read_requests
(unitrequests
)lustre_write_bytes
(unitbytes
)lustre_write_requests
(unitrequests
)lustre_open
lustre_close
lustre_getattr
lustre_setattr
lustre_statfs
lustre_inode_permission
lustre_read_bw
(ifsend_derived_values == true
, unitbytes/sec
)lustre_write_bw
(ifsend_derived_values == true
, unitbytes/sec
)lustre_read_requests_rate
(ifsend_derived_values == true
, unitrequests/sec
)lustre_write_requests_rate
(ifsend_derived_values == true
, unitrequests/sec
)lustre_read_bytes_diff
(ifsend_diff_values == true
, unitbytes
)lustre_read_requests_diff
(ifsend_diff_values == true
, unitrequests
)lustre_write_bytes_diff
(ifsend_diff_values == true
, unitbytes
)lustre_write_requests_diff
(ifsend_diff_values == true
, unitrequests
)lustre_open_diff
(ifsend_diff_values == true
)lustre_close_diff
(ifsend_diff_values == true
)lustre_getattr_diff
(ifsend_diff_values == true
)lustre_setattr_diff
(ifsend_diff_values == true
)lustre_statfs_diff
(ifsend_diff_values == true
)lustre_inode_permission_diff
(ifsend_diff_values == true
)
This collector adds an device
tag.
7.4.1.15 - memstat collector
memstat
collector
"memstat": {
"exclude_metrics": [
"mem_used"
]
}
The memstat
collector reads data from /proc/meminfo
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
mem_total
mem_sreclaimable
mem_slab
mem_free
mem_buffers
mem_cached
mem_available
mem_shared
swap_total
swap_free
mem_used
=mem_total
- (mem_free
+mem_buffers
+mem_cached
)
7.4.1.16 - netstat collector
netstat
collector
"netstat": {
"include_devices": [
"eth0"
],
"send_abs_values" : true,
"send_derived_values" : true
}
The netstat
collector reads data from /proc/net/dev
and outputs a handful node metrics. With the include_devices
list you can specify which network devices should be measured. Note: Most other collectors use an exclude list instead of an include list.
Metrics:
net_bytes_in
(unit=bytes
)net_bytes_out
(unit=bytes
)net_pkts_in
(unit=packets
)net_pkts_out
(unit=packets
)net_bytes_in_bw
(unit=bytes/sec
ifsend_derived_values == true
)net_bytes_out_bw
(unit=bytes/sec
ifsend_derived_values == true
)net_pkts_in_bw
(unit=packets/sec
ifsend_derived_values == true
)net_pkts_out_bw
(unit=packets/sec
ifsend_derived_values == true
)
The device name is added as tag stype=network,stype-id=<device>
.
7.4.1.17 - nfs3stat collector
nfs3stat
collector
"nfs3stat": {
"nfsstat" : "/path/to/nfsstat",
"exclude_metrics": [
"nfs3_total"
]
}
The nfs3stat
collector reads data from nfsstat
command and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink. There is currently no possibility to get the metrics per mount point.
Metrics:
nfs3_total
nfs3_null
nfs3_getattr
nfs3_setattr
nfs3_lookup
nfs3_access
nfs3_readlink
nfs3_read
nfs3_write
nfs3_create
nfs3_mkdir
nfs3_symlink
nfs3_remove
nfs3_rmdir
nfs3_rename
nfs3_link
nfs3_readdir
nfs3_readdirplus
nfs3_fsstat
nfs3_fsinfo
nfs3_pathconf
nfs3_commit
7.4.1.18 - nfs4stat collector
nfs4stat
collector
"nfs4stat": {
"nfsstat" : "/path/to/nfsstat",
"exclude_metrics": [
"nfs4_total"
]
}
The nfs4stat
collector reads data from nfsstat
command and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink. There is currently no possibility to get the metrics per mount point.
Metrics:
nfs4_total
nfs4_null
nfs4_read
nfs4_write
nfs4_commit
nfs4_open
nfs4_open_conf
nfs4_open_noat
nfs4_open_dgrd
nfs4_close
nfs4_setattr
nfs4_fsinfo
nfs4_renew
nfs4_setclntid
nfs4_confirm
nfs4_lock
nfs4_lockt
nfs4_locku
nfs4_access
nfs4_getattr
nfs4_lookup
nfs4_lookup_root
nfs4_remove
nfs4_rename
nfs4_link
nfs4_symlink
nfs4_create
nfs4_pathconf
nfs4_statfs
nfs4_readlink
nfs4_readdir
nfs4_server_caps
nfs4_delegreturn
nfs4_getacl
nfs4_setacl
nfs4_rel_lkowner
nfs4_exchange_id
nfs4_create_session
nfs4_destroy_session
nfs4_sequence
nfs4_get_lease_time
nfs4_reclaim_comp
nfs4_secinfo_no
nfs4_bind_conn_to_ses
7.4.1.19 - nfsiostat collector
nfsiostat
collector
"nfsiostat": {
"exclude_metrics": [
"nfsio_oread"
],
"exclude_filesystems" : [
"/mnt",
],
"use_server_as_stype": false
}
The nfsiostat
collector reads data from /proc/self/mountstats
and outputs a handful node metrics for each NFS filesystem. If a metric or filesystem is not required, it can be excluded from forwarding it to the sink.
Metrics:
nfsio_nread
: Bytes transferred by normalread()
callsnfsio_nwrite
: Bytes transferred by normalwrite()
callsnfsio_oread
: Bytes transferred byread()
calls withO_DIRECT
nfsio_owrite
: Bytes transferred bywrite()
calls withO_DIRECT
nfsio_pageread
: Pages transferred byread()
callsnfsio_pagewrite
: Pages transferred bywrite()
callsnfsio_nfsread
: Bytes transferred for reading from the servernfsio_nfswrite
: Pages transferred by writing to the server
The nfsiostat
collector adds the mountpoint to the tags as stype=filesystem,stype-id=<mountpoint>
. If the server address should be used instead of the mountpoint, use the use_server_as_stype
config setting.
7.4.1.20 - numastat collector
numastat
collector
"numastats": {}
The numastat
collector reads data from /sys/devices/system/node/node*/numastat
and outputs a handful memoryDomain metrics. See: https://www.kernel.org/doc/html/latest/admin-guide/numastat.html
Metrics:
numastats_numa_hit
: A process wanted to allocate memory from this node, and succeeded.numastats_numa_miss
: A process wanted to allocate memory from another node, but ended up with memory from this node.numastats_numa_foreign
: A process wanted to allocate on this node, but ended up with memory from another node.numastats_local_node
: A process ran on this node’s CPU, and got memory from this node.numastats_other_node
: A process ran on a different node’s CPU, and got memory from this node.numastats_interleave_hit
: Interleaving wanted to allocate from this node and succeeded.
7.4.1.21 - nvidia collector
nvidia
collector
"nvidia": {
"exclude_devices": [
"0","1", "0000000:ff:01.0"
],
"exclude_metrics": [
"nv_fb_mem_used",
"nv_fan"
],
"process_mig_devices": false,
"use_pci_info_as_type_id": true,
"add_pci_info_tag": false,
"add_uuid_meta": false,
"add_board_number_meta": false,
"add_serial_meta": false,
"use_uuid_for_mig_device": false,
"use_slice_for_mig_device": false
}
The nvidia
collector can be configured to leave out specific devices with the exclude_devices
option. It takes IDs as supplied to the NVML with nvmlDeviceGetHandleByIndex()
or the PCI address in NVML format (%08X:%02X:%02X.0
). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the exclude_metrics
option. Commonly only the physical GPUs are monitored. If MIG devices should be analyzed as well, set process_mig_devices
(adds stype=mig,stype-id=<mig_index>
). With the options use_uuid_for_mig_device
and use_slice_for_mig_device
, the <mig_index>
can be replaced with the UUID (e.g. MIG-6a9f7cc8-6d5b-5ce0-92de-750edc4d8849
) or the MIG slice name (e.g. 1g.5gb
).
The metrics sent by the nvidia
collector use accelerator
as type
tag. For the type-id
, it uses the device handle index by default. With the use_pci_info_as_type_id
option, the PCI ID is used instead. If both values should be added as tags, activate the add_pci_info_tag
option. It uses the device handle index as type-id
and adds the PCI ID as separate pci_identifier
tag.
Optionally, it is possible to add the UUID, the board part number and the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).
Metrics:
nv_util
nv_mem_util
nv_fb_mem_total
nv_fb_mem_used
nv_bar1_mem_total
nv_bar1_mem_used
nv_temp
nv_fan
nv_ecc_mode
nv_perf_state
nv_power_usage
nv_graphics_clock
nv_sm_clock
nv_mem_clock
nv_video_clock
nv_max_graphics_clock
nv_max_sm_clock
nv_max_mem_clock
nv_max_video_clock
nv_ecc_uncorrected_error
nv_ecc_corrected_error
nv_power_max_limit
nv_encoder_util
nv_decoder_util
nv_remapped_rows_corrected
nv_remapped_rows_uncorrected
nv_remapped_rows_pending
nv_remapped_rows_failure
nv_compute_processes
nv_graphics_processes
nv_violation_power
nv_violation_thermal
nv_violation_sync_boost
nv_violation_board_limit
nv_violation_low_util
nv_violation_reliability
nv_violation_below_app_clock
nv_violation_below_base_clock
nv_nvlink_crc_flit_errors
nv_nvlink_crc_errors
nv_nvlink_ecc_errors
nv_nvlink_replay_errors
nv_nvlink_recovery_errors
Some metrics add the additional sub type tag (stype
) like the nv_nvlink_*
metrics set stype=nvlink,stype-id=<link_number>
.
7.4.1.22 - rapl collector
rapl
collector
This collector reads running average power limit (RAPL) monitoring attributes to compute average power consumption metrics. See https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes.
The Likwid metric collector provides similar functionality.
"rapl": {
"exclude_device_by_id": ["0:1", "0:2"],
"exclude_device_by_name": ["psys"]
}
Metrics:
rapl_average_power
: average power consumption in Watt. The average is computed over the entire runtime from the last measurement to the current measurement
7.4.1.23 - rocm_smi collector
rocm_smi
collector
"rocm_smi": {
"exclude_devices": [
"0","1", "0000000:ff:01.0"
],
"exclude_metrics": [
"rocm_mm_util",
"rocm_temp_vrsoc"
],
"use_pci_info_as_type_id": true,
"add_pci_info_tag": false,
"add_serial_meta": false,
}
The rocm_smi
collector can be configured to leave out specific devices with the exclude_devices
option. It takes logical IDs in the list of available devices or the PCI address similar to NVML format (%08X:%02X:%02X.0
). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the exclude_metrics
option.
The metrics sent by the rocm_smi
collector use accelerator
as type
tag. For the type-id
, it uses the device handle index by default. With the use_pci_info_as_type_id
option, the PCI ID is used instead. If both values should be added as tags, activate the add_pci_info_tag
option. It uses the device handle index as type-id
and adds the PCI ID as separate pci_identifier
tag.
Optionally, it is possible to add the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).
Metrics:
rocm_gfx_util
rocm_umc_util
rocm_mm_util
rocm_avg_power
rocm_temp_mem
rocm_temp_hotspot
rocm_temp_edge
rocm_temp_vrgfx
rocm_temp_vrsoc
rocm_temp_vrmem
rocm_gfx_clock
rocm_soc_clock
rocm_u_clock
rocm_v0_clock
rocm_v1_clock
rocm_d0_clock
rocm_d1_clock
rocm_temp_hbm
Some metrics add the additional sub type tag (stype
) like the rocm_temp_hbm
metrics set stype=device,stype-id=<HBM_slice_number>
.
7.4.1.24 - schedstat collector
schedstat
collector
"schedstat": {
}
The schedstat
collector reads data from /proc/schedstat and calculates a load value, separated by hwthread. This might be useful to detect bad cpu pinning on shared nodes etc.
Metric:
cpu_load_core
7.4.1.25 - self collector
self
collector
"self": {
"read_mem_stats" : true,
"read_goroutines" : true,
"read_cgo_calls" : true,
"read_rusage" : true
}
The self
collector reads the data from the runtime
and syscall
packages, so monitors the execution of the cc-metric-collector itself.
Metrics:
- If
read_mem_stats == true
:total_alloc
: The metric reports cumulative bytes allocated for heap objects.heap_alloc
: The metric reports bytes of allocated heap objects.heap_sys
: The metric reports bytes of heap memory obtained from the OS.heap_idle
: The metric reports bytes in idle (unused) spans.heap_inuse
: The metric reports bytes in in-use spans.heap_released
: The metric reports bytes of physical memory returned to the OS.heap_objects
: The metric reports the number of allocated heap objects.
- If
read_goroutines == true
:num_goroutines
: The metric reports the number of goroutines that currently exist.
- If
read_cgo_calls == true
:num_cgo_calls
: The metric reports the number of cgo calls made by the current process.
- If
read_rusage == true
:rusage_user_time
: The metric reports the amount of time that this process has been scheduled in user mode.rusage_system_time
: The metric reports the amount of time that this process has been scheduled in kernel mode.rusage_vol_ctx_switch
: The metric reports the amount of voluntary context switches.rusage_invol_ctx_switch
: The metric reports the amount of involuntary context switches.rusage_signals
: The metric reports the number of signals received.rusage_major_pgfaults
: The metric reports the number of major faults the process has made which have required loading a memory page from disk.rusage_minor_pgfaults
: The metric reports the number of minor faults the process has made which have not required loading a memory page from disk.
7.4.1.26 - tempstat collector
tempstat
collector
"tempstat": {
"tag_override" : {
"<device like hwmon1>" : {
"type" : "socket",
"type-id" : "0"
}
},
"exclude_metrics": [
"metric1",
"metric2"
]
}
The tempstat
collector reads the data from /sys/class/hwmon/<device>/tempX_{input,label}
Metrics:
temp_*
: The metric name is taken from thelabel
files.
7.4.1.27 - topprocs collector
topprocs
collector
"topprocs": {
"num_procs": 5
}
The topprocs
collector reads the TopX processes (sorted by CPU utilization, ps -Ao comm --sort=-pcpu
).
In contrast to most other collectors, the metric value is a string
.
7.4.2 - cc-metric-collector's message processor
Message Processor Component
Multiple parts of in the ClusterCockit ecosystem require the processing of CCMessages.
The main CC application using it is cc-metric-collector
. The processing part there was originally in the metric router, the central
hub connecting collectors (reading local data), receivers (receiving remote data) and sinks (sending data). Already in early stages, the
lack of flexibility caused some trouble:
The sysadmins wanted to keep operating their Ganglia based monitoring infrastructure while we developed the CC stack. Ganglia wants the core metrics with a specific name and resolution (right unit prefix) but there was no conversion of the data in the CC stack, so CC frontend developers wanted a different resolution for some metrics. The issue was basically the
mem_used
metric showing the currently used memory of the node. Ganglia wants it inkByte
as provided by the Linux operating system but CC wanted it inGByte
.
With the message processor, the Ganglia sinks can apply the unit prefix changes individually and name the metrics as required by Ganglia.
For developers
Whenever you receive or are about to send a message out, you should provide some processing.
Configuration of component
New operations can be added to the message processor at runtime. Of course, they can also be removed again. For the initial setup, having a configuration file or some fields in a configuration file for the processing.
The message processor uses the following configuration
{
"drop_messages": [
"name_of_message_to_drop"
],
"drop_messages_if": [
"condition_when_to_drop_message",
"name == 'drop_this'",
"tag.hostname == 'this_host'",
"meta.unit != 'MB'"
],
"rename_messages" : {
"old_message_name" : "new_message_name"
},
"rename_messages_if": {
"condition_when_to_rename_message" : "new_name"
},
"add_tags_if": [
{
"if" : "condition_when_to_add_tag",
"key": "name_for_new_tag",
"value": "new_tag_value"
}
],
"delete_tags_if": [
{
"if" : "condition_when_to_delete_tag",
"key": "name_of_tag"
}
],
"add_meta_if": [
{
"if" : "condition_when_to_add_meta_info",
"key": "name_for_new_meta_info",
"value": "new_meta_info_value"
}
],
"delete_meta_if": [
{
"if" : "condition_when_to_delete_meta_info",
"key": "name_of_meta_info"
}
],
"add_field_if": [
{
"if" : "condition_when_to_add_field",
"key": "name_for_new_field",
"value": "new_field_value_but_only_string_at_the_moment"
}
],
"delete_field_if": [
{
"if" : "condition_when_to_delete_field",
"key": "name_of_field"
}
],
"move_tag_to_meta_if": [
{
"if" : "condition_when_to_move_tag_to_meta_info_including_its_value",
"key": "name_of_tag",
"value": "name_of_meta_info"
}
],
"move_tag_to_field_if": [
{
"if" : "condition_when_to_move_tag_to_fields_including_its_value",
"key": "name_of_tag",
"value": "name_of_field"
}
],
"move_meta_to_tag_if": [
{
"if" : "condition_when_to_move_meta_info_to_tags_including_its_value",
"key": "name_of_meta_info",
"value": "name_of_tag"
}
],
"move_meta_to_field_if": [
{
"if" : "condition_when_to_move_meta_info_to_fields_including_its_value",
"key": "name_of_tag",
"value": "name_of_meta_info"
}
],
"move_field_to_tag_if": [
{
"if" : "condition_when_to_move_field_to_tags_including_its_stringified_value",
"key": "name_of_field",
"value": "name_of_tag"
}
],
"move_field_to_meta_if": [
{
"if" : "condition_when_to_move_field_to_meta_info_including_its_stringified_value",
"key": "name_of_field",
"value": "name_of_meta_info"
}
],
"drop_by_message_type": [
"metric",
"event",
"log",
"control"
],
"change_unit_prefix": {
"name == 'metric_with_wrong_unit_prefix'" : "G",
"only_if_messagetype == 'metric'": "T"
},
"normalize_units": true,
"add_base_env": {
"MY_CONSTANT_FOR_CUSTOM_CONDITIONS": 1.0,
"output_value_for_test_metrics": 42.0,
},
"stage_order": [
"rename_messages_if",
"drop_messages"
]
}
The options change_unit_prefix
and normalize_units
are only applied to CCMetrics. It is not possible to delete the field related to each message type as defined in cc-specification. In short:
- CCMetrics always have to have a field named
value
- CCEvents always have to have a field named
event
- CCLogs always have to have a field named
log
- CCControl messages always have to have a field named
control
With add_base_env
, one can specifiy mykey=myvalue pairs that can be used in conditions like tag.type == mykey
.
The order in which each message is processed, can be specified with the stage_order
option. The stage names are the keys in the JSON configuration, thus change_unit_prefix
, move_field_to_meta_if
, etc. Stages can be listed multiple times.
Using the component
In order to load the configuration from a json.RawMessage
:
mp, err := NewMessageProcessor()
if err != nil {
log.Error("failed to create new message processor")
}
mp.FromConfigJSON(configJson)
After initialization and adding the different operations, the ProcessMessage()
function applies all operations and returns whether the message should be dropped.
m := lp.CCMetric{}
x, err := mp.ProcessMessage(m)
if err != nil {
// handle error
}
if x != nil {
// process x further
} else {
// this message got dropped
}
Single operations can be added and removed at runtime
type MessageProcessor interface {
// Functions to set the execution order of the processing stages
SetStages([]string) error
DefaultStages() []string
// Function to add variables to the base evaluation environment
AddBaseEnv(env map[string]interface{}) error
// Functions to add and remove rules
AddDropMessagesByName(name string) error
RemoveDropMessagesByName(name string)
AddDropMessagesByCondition(condition string) error
RemoveDropMessagesByCondition(condition string)
AddRenameMetricByCondition(condition string, name string) error
RemoveRenameMetricByCondition(condition string)
AddRenameMetricByName(from, to string) error
RemoveRenameMetricByName(from string)
SetNormalizeUnits(settings bool)
AddChangeUnitPrefix(condition string, prefix string) error
RemoveChangeUnitPrefix(condition string)
AddAddTagsByCondition(condition, key, value string) error
RemoveAddTagsByCondition(condition string)
AddDeleteTagsByCondition(condition, key, value string) error
RemoveDeleteTagsByCondition(condition string)
AddAddMetaByCondition(condition, key, value string) error
RemoveAddMetaByCondition(condition string)
AddDeleteMetaByCondition(condition, key, value string) error
RemoveDeleteMetaByCondition(condition string)
AddMoveTagToMeta(condition, key, value string) error
RemoveMoveTagToMeta(condition string)
AddMoveTagToFields(condition, key, value string) error
RemoveMoveTagToFields(condition string)
AddMoveMetaToTags(condition, key, value string) error
RemoveMoveMetaToTags(condition string)
AddMoveMetaToFields(condition, key, value string) error
RemoveMoveMetaToFields(condition string)
AddMoveFieldToTags(condition, key, value string) error
RemoveMoveFieldToTags(condition string)
AddMoveFieldToMeta(condition, key, value string) error
RemoveMoveFieldToMeta(condition string)
// Read in a JSON configuration
FromConfigJSON(config json.RawMessage) error
ProcessMessage(m lp2.CCMessage) (lp2.CCMessage, error)
// Processing functions for legacy CCMetric and current CCMessage
ProcessMetric(m lp.CCMetric) (lp2.CCMessage, error)
}
Syntax for evaluatable terms
The message processor uses gval
for evaluating the terms. It provides a basic set of operators like string comparison and arithmetic operations.
Accessible for operations are
name
of the messagetimestamp
ortime
of the messagetype
,type-id
of the message (alsotag_type
,tag_type-id
andtag_typeid
)stype
,stype-id
of the message (if message has theses tags, alsotag_stype
,tag_stype-id
andtag_stypeid
)value
for a CCMetric message (alsofield_value
)event
for a CCEvent message (alsofield_event
)control
for a CCControl message (alsofield_control
)log
for a CCLog message (alsofield_log
)messagetype
ormsgtype
. Possible valuesevent
,metric
,log
andcontrol
.
Generally, all tags are accessible with tag_<tagkey>
, tags_<tagkey>
or tags.<tagkey>
. Similarly for all fields with field[s]?[_.]<fieldkey>
. For meta information meta[_.]<metakey>
(there is no metas[_.]<metakey>
).
The syntax of expr
is accepted with some additions:
- Comparing strings:
==
,!=
,str matches regex
(use%
instead of\
!) - Combining conditions:
&&
,||
- Comparing numbers:
==
,!=
,<
,>
,<=
,>=
- Test lists:
<value> in <list>
- Topological tests:
tag_type-id in getCpuListOfType("socket", "1")
(test if the metric belongs to socket 1 in local node topology)
Often the operations are written in JSON files for loading them at startup. In JSON, some characters are not allowed. Therefore, the term syntax reflects that:
- use
''
instead of""
for strings - for the regexes, use
%
instead of\
For operations that should be applied on all messages, use the condition true
.
Overhead
The operations taking conditions are pre-processed, which is commonly the time consuming part but, of course, with each added operation, the time to process a message increases. Moreover, the processing creates a copy of the message.
7.4.3 - cc-metric-collector's receivers
CCMetric receivers
This folder contains the ReceiveManager and receiver implementations for the cc-metric-collector.
Configuration
The configuration file for the receivers is a list of configurations. The type
field in each specifies which receiver to initialize.
{
"myreceivername" : {
"type": "receiver-type",
<receiver-specific configuration>
}
}
This allows to specify
Available receivers
nats
: Receive metrics from the NATS networkprometheus
: Scrape data from a Prometheus clienthttp
: Listen for HTTP Post requests transporting metrics in InfluxDB line protocolipmi
: Read IPMI sensor readingsredfish
Use the Redfish (specification) to query thermal and power metrics
Contributing own receivers
A receiver contains a few functions and is derived from the type Receiver
(in metricReceiver.go
):
For an example, check the sample receiver
7.4.3.1 - http receiver
http
receiver
The http
receiver can be used receive metrics through HTTP POST requests.
Configuration structure
{
"<name>": {
"type": "http",
"address" : "",
"port" : "8080",
"path" : "/write",
"idle_timeout": "120s",
"username": "myUser",
"password": "myPW"
}
}
type
: makes the receiver ahttp
receiveraddress
: Listen addressport
: Listen portpath
: URL path for the write endpointidle_timeout
: Maximum amount of time to wait for the next request when keep-alives are enabled should be larger than the measurement interval to keep the connection openkeep_alives_enabled
: Controls whether HTTP keep-alives are enabled. By default, keep-alives are enabled.username
: username for basic authenticationpassword
: password for basic authentication
The HTTP endpoint listens to http://<address>:<port>/<path>
Debugging
Install curl
Use curl to send message to
http
receivercurl http://localhost:8080/write \ --user "myUser:myPW" \ --data \ "myMetric,hostname=myHost,type=hwthread,type-id=0,unit=Hz value=400000i 1694777161164284635 myMetric,hostname=myHost,type=hwthread,type-id=1,unit=Hz value=400001i 1694777161164284635"
7.4.3.2 - IPMI Receiver
IPMI Receiver
The IPMI Receiver uses ipmi-sensors
from the FreeIPMI project to read IPMI sensor readings and sensor data repository (SDR) information. The available metrics depend on the sensors provided by the hardware vendor but typically contain temperature, fan speed, voltage and power metrics.
Configuration structure
{
"<IPMI receiver name>": {
"type": "ipmi",
"interval": "30s",
"fanout": 256,
"username": "<Username>",
"password": "<Password>",
"endpoint": "ipmi-sensors://%h-bmc",
"exclude_metrics": [ "fan_speed", "voltage" ],
"client_config": [
{
"host_list": "n[1,2-4]"
},
{
"host_list": "n[5-6]",
"driver_type": "LAN",
"cli_options": [ "--workaround-flags=..." ],
"password": "<Password 2>"
}
]
}
}
Global settings:
interval
: How often the IPMI sensor metrics should be read and send to the sink (default: 30 s)
Global and per IPMI device settings (per IPMI device settings overwrite the global settings):
exclude_metrics
: list of excluded metrics e.g. fan_speed, power, temperature, utilization, voltagefanout
: Maximum number of simultaneous IPMI connections (default: 64)driver_type
: Out of band IPMI driver (default: LAN_2_0)username
: User name to authenticate withpassword
: Password to use for authenticationendpoint
: URL of the IPMI device (placeholder%h
gets replaced by the hostname)
Per IPMI device settings:
host_list
: List of hosts with the same client configurationcli_options
: Additional command line options for ipmi-sensors
7.4.3.3 - nats receiver
nats
receiver
The nats
receiver can be used receive metrics from the NATS network. The nats
receiver subscribes to the topic database
and listens on address
and port
for metrics in the InfluxDB line protocol.
Configuration structure
{
"<name>": {
"type": "nats",
"address" : "nats-server.example.org",
"port" : "4222",
"subject" : "subject",
"user": "natsuser",
"password": "natssecret",
"nkey_file": "/path/to/nkey_file"
}
}
type
: makes the receiver anats
receiveraddress
: Address of the NATS control serverport
: Port of the NATS control serversubject
: Subscribes to this subject and receive metricsuser
: Connect to nats using this userpassword
: Connect to nats using this passwordnkey_file
: Path to credentials file with NKEY
Debugging
Install NATS server and command line client
Start NATS server
nats-server --net nats-server.example.org --port 4222
Check NATS server works as expected
nats --server=nats-server-db.example.org:4222 server check
Use NATS command line client to subscribe to all messages
nats --server=nats-server-db.example.org:4222 sub ">"
Use NATS command line client to send message to NATS receiver
nats --server=nats-server-db.example.org:4222 pub subject \ "myMetric,hostname=myHost,type=hwthread,type-id=0,unit=Hz value=400000i 1694777161164284635 myMetric,hostname=myHost,type=hwthread,type-id=1,unit=Hz value=400001i 1694777161164284635"
7.4.3.4 - prometheus receiver
prometheus
receiver
The prometheus
receiver can be used to scrape the metrics of a single prometheus
client. It does not use any official Golang library but making simple HTTP get requests and parse the response.
Configuration structure
{
"<name>": {
"type": "prometheus",
"address" : "testpromhost",
"port" : "12345",
"path" : "/prometheus",
"interval": "5s",
"ssl" : true,
}
}
type
: makes the receiver aprometheus
receiveraddress
: Hostname or IP of the Prometheus agentport
: Port of Prometheus agentpath
: Path to the Prometheus endpointinterval
: Scrape the Prometheus endpoint in this interval (default ‘5s’)ssl
: Use SSL or not
The receiver requests data from http(s)://<address>:<port>/<path>
.
7.4.3.5 - Redfish receiver
Redfish receiver
The Redfish receiver uses the Redfish (specification) to query thermal and power metrics. Thermal metrics may include various fan speeds and temperatures. Power metrics may include the current power consumption of various hardware components. It may also include the minimum, maximum and average power consumption of these components in a given time interval. The receiver will poll each configured redfish device once in a given interval. Multiple devices can be accessed in parallel to increase throughput.
Configuration structure
{
"<redfish receiver name>": {
"type": "redfish",
"username": "<Username>",
"password": "<Password>",
"endpoint": "https://%h-bmc",
"exclude_metrics": [ "min_consumed_watts" ],
"client_config": [
{
"host_list": "n[1,2-4]"
},
{
"host_list": "n5",
"disable_power_metrics": true,
"disable_processor_metrics": true,
"disable_thermal_metrics": true
},
{
"host_list": "n6" ],
"username": "<Username 2>",
"password": "<Password 2>",
"endpoint": "https://%h-BMC",
"disable_sensor_metrics": true
}
]
}
}
Global settings:
fanout
: Maximum number of simultaneous redfish connections (default: 64)interval
: How often the redfish power metrics should be read and send to the sink (default: 30 s)http_insecure
: Control whether a client verifies the server’s certificate (default: true == do not verify server’s certificate)http_timeout
: Time limit for requests made by this HTTP client (default: 10 s)
Global and per redfish device settings (per redfish device settings overwrite the global settings):
disable_power_metrics
: disable collection of power metrics (/redfish/v1/Chassis/{ChassisId}/Power
)disable_processor_metrics
: disable collection of processor metrics (/redfish/v1/Systems/{ComputerSystemId}/Processors/{ProcessorId}/ProcessorMetrics
)disable_sensors
: disable collection of fan, power and thermal sensor metrics (/redfish/v1/Chassis/{ChassisId}/Sensors/{SensorId}
)disable_thermal_metrics
: disable collection of thermal metrics (/redfish/v1/Chassis/{ChassisId}/Thermal
)exclude_metrics
: list of excluded metricsusername
: User name to authenticate withpassword
: Password to use for authenticationendpoint
: URL of the redfish service (placeholder%h
gets replaced by the hostname)
Per redfish device settings:
host_list
: List of hosts with the same client configuration
7.4.4 - cc-metric-collector's router
CC Metric Router
The CCMetric router sits in between the collectors and the sinks and can be used to add and remove tags to/from traversing [CCMessages](https://pkg.go.dev/github.com/ClusterCockpit/cc-energy-manager@v0.0.0-20240919152819-92a17f2da4f7/pkg/cc-message.
Configuration
Note: Use the message processor configuration with option process_messages
.
{
"num_cache_intervals" : 1,
"interval_timestamp" : true,
"hostname_tag" : "hostname",
"max_forward" : 50,
"process_messages": {
"see": "pkg/messageProcessor/README.md"
},
"add_tags" : [
{
"key" : "cluster",
"value" : "testcluster",
"if" : "*"
},
{
"key" : "test",
"value" : "testing",
"if" : "name == 'temp_package_id_0'"
}
],
"delete_tags" : [
{
"key" : "unit",
"value" : "*",
"if" : "*"
}
],
"interval_aggregates" : [
{
"name" : "temp_cores_avg",
"if" : "match('temp_core_%d+', metric.Name())",
"function" : "avg(values)",
"tags" : {
"type" : "node"
},
"meta" : {
"group": "IPMI",
"unit": "degC",
"source": "TempCollector"
}
}
],
"drop_metrics" : [
"not_interesting_metric_at_all"
],
"drop_metrics_if" : [
"match('temp_core_%d+', metric.Name())"
],
"rename_metrics" : {
"metric_12345" : "mymetric"
},
"normalize_units" : true,
"change_unit_prefix" : {
"mem_used" : "G",
"mem_total" : "G"
}
}
There are three main options add_tags
, delete_tags
and interval_timestamp
. add_tags
and delete_tags
are lists consisting of dicts with key
, value
and if
. The value
can be omitted in the delete_tags
part as it only uses the key
for removal. The interval_timestamp
setting means that a unique timestamp is applied to all metrics traversing the router during an interval.
Note: Use the message processor configuration (option process_messages
) instead of add_tags
, delete_tags
, drop_metrics
, drop_metrics_if
, rename_metrics
, normalize_units
and change_unit_prefix
. These options are deprecated and will be removed in future versions. Until then, they are added to the message processor.
Processing order in the router
- Add the
hostname_tag
tag (if sent by collectors or cache) - If
interval_timestamp == true
, change time of metrics - Check if metric should be dropped (
drop_metrics
anddrop_metrics_if
) - Add tags from
add_tags
- Delete tags from
del_tags
- Rename metric based on
rename_metrics
and store old name asoldname
in meta information - Add tags from
add_tags
(if you used the new name in theif
condition) - Delete tags from
del_tags
(if you used the new name in theif
condition) - Send to sinks
- Move to cache (if
num_cache_intervals > 0
)
The interval_timestamp
option
The collectors’ Read()
functions are not called simultaneously and therefore the metrics gathered in an interval can have different timestamps. If you want to avoid that and have a common timestamp (the beginning of the interval), set this option to true
and the MetricRouter sets the time.
The num_cache_intervals
option
If the MetricRouter should buffer metrics of intervals in a MetricCache, this option specifies the number of past intervals that should be kept. If num_cache_intervals = 0
, the cache is disabled. With num_cache_intervals = 1
, only the metrics of the last interval are buffered.
A num_cache_intervals > 0
is required to use the interval_aggregates
option.
The hostname_tag
option
By default, the router tags metrics with the hostname for all locally created metrics. The default tag name is hostname
, but it can be changed if your organization wants anything else
The max_forward
option
Every time the router receives a metric through any of the channels, it tries to directly read up to max_forward
metrics from the same channel. This was done as the router thread would go to sleep and wake up with every arriving metric. The default are 50
metrics at once and max_forward
needs to greater than 1
.
The rename_metrics
option
deprecated
In the ClusterCockpit world we specified a set of standard metrics. Since some collectors determine the metric names based on files, execuables and libraries, they might change from system to system (or installation to installtion, OS to OS, …). In order to get the common names, you can rename incoming metrics before sending them to the sink. If the metric name matches the oldname
, it is changed to newname
{
"oldname" : "newname",
"clock_mhz" : "clock"
}
Conditional manipulation of tags (add_tags
and del_tags
)
deprecated
Common config format:
{
"key" : "test",
"value" : "testing",
"if" : "name == 'temp_package_id_0'"
}
The del_tags
option
deprecated
The collectors are free to add whatever key=value
pair to the metric tags (although the usage of tags should be minimized). If you want to delete a tag afterwards, you can do that. When the if
condition matches on a metric, the key
is removed from the metric’s tags.
If you want to remove a tag for all metrics, use the condition wildcard *
. The value
field can be omitted in the del_tags
case.
Never delete tags:
hostname
type
type-id
The add_tags
option
deprecated
In some cases, metrics should be tagged or an existing tag changed based on some condition. This can be done in the add_tags
section. When the if
condition evaluates to true
, the tag key
is added or gets changed to the new value
.
If the CCMetric name is equal to temp_package_id_0
, it adds an additional tag test=testing
to the metric.
For this metric, a more useful example would be:
[
{
"key" : "type",
"value" : "socket",
"if" : "name == 'temp_package_id_0'"
},
{
"key" : "type-id",
"value" : "0",
"if" : "name == 'temp_package_id_0'"
},
]
The metric temp_package_id_0
corresponds to the tempature of the first CPU socket (=package). With the above configuration, the tags would reflect that because commonly the TempCollector submits only node
metrics.
In order to match all metrics, you can use *
, so in order to add a flag per default. This is useful to attached system-specific tags like cluster=testcluster
:
{
"key" : "cluster",
"value" : "testcluster",
"if" : "*"
}
Dropping metrics
In some cases, you want to drop a metric and don’t get it forwarded to the sinks. There are two options based on the required specification:
- Based only on the metric name ->
drop_metrics
section - An evaluable condition with more overhead ->
drop_metrics_if
section
The drop_metrics
section
deprecated
The argument is a list of metric names. No futher checks are performed, only a comparison of the metric name
{
"drop_metrics" : [
"drop_metric_1",
"drop_metric_2"
]
}
The example drops all metrics with the name drop_metric_1
and drop_metric_2
.
The drop_metrics_if
section
deprecated
This option takes a list of evaluable conditions and performs them one after the other on all metrics incoming from the collectors and the metric cache (aka interval_aggregates
).
{
"drop_metrics_if" : [
"match('drop_metric_%d+', name)",
"match('cpu', type) && type-id == 0"
]
}
The first line is comparable with the example in drop_metrics
, it drops all metrics starting with drop_metric_
and ending with a number. The second line drops all metrics of the first hardware thread (not recommended)
Manipulating the metric units
The normalize_units
option
deprecated
The cc-metric-collector tries to read the data from the system as it is reported. If available, it tries to read the metric unit from the system as well (e.g. from /proc/meminfo
). The problem is that, depending on the source, the metric units are named differently. Just think about byte
, Byte
, B
, bytes
, …
The cc-units package provides us a normalization option to use the same metric unit name for all metrics. It this option is set to true, all unit
meta tags are normalized.
The change_unit_prefix
section
deprecated
It is often the case that metrics are reported by the system using a rather outdated unit prefix (like /proc/meminfo
still uses kByte despite current memory sizes are in the GByte range). If you want to change the prefix of a unit, you can do that with the help of cc-units. The setting works on the metric name and requires the new prefix for the metric. The cc-units package determines the scaling factor.
Aggregate metric values of the current interval with the interval_aggregates
option
Note: interval_aggregates
works only if num_cache_intervals
> 0 and is experimental
In some cases, you need to derive new metrics based on the metrics arriving during an interval. This can be done in the interval_aggregates
section. The logic is similar to the other metric manipulation and filtering options. A cache stores all metrics that arrive during an interval. At the beginning of the next interval, the list of metrics is submitted to the MetricAggregator. It derives new metrics and submits them back to the MetricRouter, so they are sent in the next interval but have the timestamp of the previous interval beginning.
"interval_aggregates" : [
{
"name" : "new_metric_name",
"if" : "match('sub_metric_%d+', metric.Name())",
"function" : "avg(values)",
"tags" : {
"key" : "value",
"type" : "node"
},
"meta" : {
"key" : "value",
"group": "IPMI",
"unit": "<copy>",
}
}
]
The above configuration, collects all metric values for metrics evaluating if
to true
. Afterwards it calculates the average avg
of the values
(list of all metrics’ field value
) and creates a new CCMetric with the name new_metric_name
and adds the tags in tags
and the meta information in meta
. The special value <copy>
searches the input metrics and copies the value of the first match of key
to the new CCMetric.
If you are not interested in the input metrics sub_metric_%d+
at all, you can add the same condition used here to the drop_metrics_if
section to drop them.
Use cases for interval_aggregates
:
- Combine multiple metrics of the a collector to a new one like the MemstatCollector does it for
mem_used
)):
{
"name" : "mem_used",
"if" : "source == 'MemstatCollector'",
"function" : "sum(mem_total) - (sum(mem_free) + sum(mem_buffers) + sum(mem_cached))",
"tags" : {
"type" : "node"
},
"meta" : {
"group": "<copy>",
"unit": "<copy>",
"source": "<copy>"
}
}
Order of operations
The router performs the above mentioned options in a specific order. In order to get the logic you want for a specific metric, it is crucial to know the processing order:
- Add the
hostname
tag (c) - Manipulate the timestamp to the interval timestamp (c,r)
- Drop metrics based on
drop_metrics
anddrop_metrics_if
(c,r) - Add tags based on
add_tags
(c,r) - Delete tags based on
del_tags
(c,r) - Rename metric based on
rename_metric
(c,r)- Add tags based on
add_tags
to still work if the configuration uses the new name (c,r) - Delete tags based on
del_tags
to still work if the configuration uses the new name (c,r)
- Add tags based on
- Normalize units when
normalize_units
is set (c,r) - Convert unit prefix based on
change_unit_prefix
(c,r)
Legend:
- ‘c’ if metric is coming from a collector
- ‘r’ if metric is coming from a receiver
7.4.5 - cc-metric-collector's sinks
CCMetric sinks
This folder contains the SinkManager and sink implementations for the cc-metric-collector.
Available sinks:
stdout
: Print all metrics tostdout
,stderr
or a filehttp
: Send metrics to an HTTP server as POST requestsinfluxdb
: Send metrics to an InfluxDB databaseinfluxasync
: Send metrics to an InfluxDB database with non-blocking write APInats
: Publish metrics to the NATS network overlay systemganglia
: Publish metrics in the Ganglia Monitoring System using thegmetric
CLI toollibganglia
: Publish metrics in the Ganglia Monitoring System directly usinglibganglia.so
prometeus
: Publish metrics for the Prometheus Monitoring System
Configuration
The configuration file for the sinks is a list of configurations. The type
field in each specifies which sink to initialize.
{
"mystdout" : {
"type" : "stdout",
"meta_as_tags" : [
"unit"
]
},
"metricstore" : {
"type" : "http",
"host" : "localhost",
"port" : "4123",
"database" : "ccmetric",
"password" : "<jwt token>"
}
}
Contributing own sinks
A sink contains five functions and is derived from the type sink
:
Init(name string, config json.RawMessage) error
Write(point CCMetric) error
Flush() error
Close()
New<Typename>(name string, config json.RawMessage) (Sink, error)
(calls theInit()
function)
The data structures should be set up in Init()
like opening a file or server connection. The Write()
function writes/sends the data. For non-blocking sinks, the Flush()
method tells the sink to drain its internal buffers. The Close()
function should tear down anything created in Init()
.
Finally, the sink needs to be registered in the sinkManager.go
. There is a list of sinks called AvailableSinks
which is a map (sink_type_string
-> pointer to sink interface
). Add a new entry with a descriptive name and the new sink.
Sample sink
package sinks
import (
"encoding/json"
"log"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
type SampleSinkConfig struct {
defaultSinkConfig // defines JSON tags for 'name' and 'meta_as_tags'
}
type SampleSink struct {
sink // declarate 'name' and 'meta_as_tags'
config StdoutSinkConfig // entry point to the SampleSinkConfig
}
// Initialize the sink by giving it a name and reading in the config JSON
func (s *SampleSink) Init(name string, config json.RawMessage) error {
s.name = fmt.Sprintf("SampleSink(%s)", name) // Always specify a name here
// Read in the config JSON
if len(config) > 0 {
err := json.Unmarshal(config, &s.config)
if err != nil {
return err
}
}
return nil
}
// Code to submit a single CCMetric to the sink
func (s *SampleSink) Write(point lp.CCMetric) error {
log.Print(point)
return nil
}
// If the sink uses batched sends internally, you can tell to flush its buffers
func (s *SampleSink) Flush() error {
return nil
}
// Close sink: close network connection, close files, close libraries, ...
func (s *SampleSink) Close() {}
// New function to create a new instance of the sink
func NewSampleSink(name string, config json.RawMessage) (Sink, error) {
s := new(SampleSink)
err := s.Init(name, config)
return s, err
}
7.4.5.1 - ganglia sink
ganglia
sink
The ganglia
sink uses the gmetric
tool of the Ganglia Monitoring System to submit the metrics
Configuration structure
{
"<name>": {
"type": "ganglia",
"gmetric_path" : "/path/to/gmetric",
"add_ganglia_group" : true,
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink anganglia
sinkgmetric_path
: Path togmetric
executable (optional). If not given, the sink searches in$PATH
forgmetric
.add_ganglia_group
: Add--group=X
based on meta information to thegmetric
call. Some old versions ofgmetric
do not support the--group
option.process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
7.4.5.2 - http sink
http
sink
The http
sink uses POST requests to a HTTP server to submit the metrics in the InfluxDB line-protocol format. It uses JSON web tokens for authentification. The sink creates batches of metrics before sending, to reduce the HTTP traffic.
Configuration structure
{
"<name>": {
"type": "http",
"url" : "https://my-monitoring.example.com:1234/api/write",
"jwt" : "blabla.blabla.blabla",
"username": "myUser",
"password": "myPW",
"timeout": "5s",
"idle_connection_timeout" : "5s",
"flush_delay": "2s",
"batch_size": 1000,
"precision": "s",
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink anhttp
sinkurl
: The full URL of the endpointjwt
: JSON web tokens for authentication (Using the Bearer scheme)username
: username for basic authenticationpassword
: password for basic authenticationtimeout
: General timeout for the HTTP client (default ‘5s’)max_retries
: Maximum number of retries to connect to the http serveridle_connection_timeout
: Timeout for idle connections (default ‘120s’). Should be larger than the measurement interval to keep the connection openflush_delay
: Batch all writes arriving in during this duration (default ‘1s’, batching can be disabled by setting it to 0)batch_size
: Maximal batch size. Ifbatch_size
is reached before the end offlush_delay
, the metrics are sent without further delayprecision
: Precision of the timestamp. Valid values are ’s’, ‘ms’, ‘us’ and ’ns’. (default is ’s’)process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
Using http
sink for communication with cc-metric-store
The cc-metric-store only accepts metrics with a timestamp precision in seconds, so it is required to use "precision": "s"
.
7.4.5.3 - influxasync sink
influxasync
sink
The influxasync
sink uses the official InfluxDB golang client to write the metrics to an InfluxDB database in a non-blocking fashion. It provides only support for V2 write endpoints (InfluxDB 1.8.0 or later).
Configuration structure
{
"<name>": {
"type": "influxasync",
"database" : "mymetrics",
"host": "dbhost.example.com",
"port": "4222",
"user": "exampleuser",
"password" : "examplepw",
"organization": "myorg",
"ssl": true,
"batch_size": 200,
"retry_interval" : "1s",
"retry_exponential_base" : 2,
"precision": "s",
"max_retries": 20,
"max_retry_time" : "168h",
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink aninfluxdb
sinkdatabase
: All metrics are written to this buckethost
: Hostname of the InfluxDB database serverport
: Portnumber (as string) of the InfluxDB database serveruser
: Username for basic authentificationpassword
: Password for basic authentificationorganization
: Organization in the InfluxDBssl
: Use SSL connectionbatch_size
: batch up metrics internally, default 100retry_interval
: Base retry interval for failed write requests, default 1sretry_exponential_base
: The retry interval is exponentially increased with this base, default 2max_retries
: Maximal number of retry attemptsmax_retry_time
: Maximal time to retry failed writes, default 168h (one week)precision
: Precision of the timestamp. Valid values are ’s’, ‘ms’, ‘us’ and ’ns’. (default is ’s’)process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
For information about the calculation of the retry interval settings, see offical influxdb-client-go documentation
Using influxasync
sink for communication with cc-metric-store
The cc-metric-store only accepts metrics with a timestamp precision in seconds, so it is required to use "precision": "s"
.
7.4.5.4 - influxdb sink
influxdb
sink
The influxdb
sink uses the official InfluxDB golang client to write the metrics to an InfluxDB database in a blocking fashion. It provides only support for V2 write endpoints (InfluxDB 1.8.0 or later).
Configuration structure
{
"<name>": {
"type": "influxdb",
"database" : "mymetrics",
"host": "dbhost.example.com",
"port": "4222",
"user": "exampleuser",
"password" : "examplepw",
"organization": "myorg",
"ssl": true,
"flush_delay" : "1s",
"batch_size" : 1000,
"use_gzip": true,
"precision": "s",
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink aninfluxdb
sinkdatabase
: All metrics are written to this buckethost
: Hostname of the InfluxDB database serverport
: Port number (as string) of the InfluxDB database serveruser
: Username for basic authenticationpassword
: Password for basic authenticationorganization
: Organization in the InfluxDBssl
: Use SSL connectionflush_delay
: Group metrics coming in to a single batchbatch_size
: Maximal batch size. Ifbatch_size
is reached before the end offlush_delay
, the metrics are sent without further delayprecision
: Precision of the timestamp. Valid values are ’s’, ‘ms’, ‘us’ and ’ns’. (default is ’s’)process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
Influx client options:
batch_size
: Maximal batch sizemeta_as_tags
: move meta information keys to tags (optional)http_request_timeout
: HTTP request timeoutretry_interval
: retry intervalmax_retry_interval
: maximum delay between each retry attemptretry_exponential_base
: base for the exponential retry delaymax_retries
: maximum count of retry attempts of failed writesmax_retry_time
: maximum total retry timeoutuse_gzip
: Specify whether to use GZip compression in write requests
Using influxdb
sink for communication with cc-metric-store
The cc-metric-store only accepts metrics with a timestamp precision in seconds, so it is required to use "precision": "s"
.
7.4.5.5 - libganglia sink
libganglia
sink
The libganglia
sink interacts directly with the library of the Ganglia Monitoring System to submit the metrics. Consequently, it needs to be installed on all nodes. But this is commonly the case if you want to use Ganglia, because it requires at least a node daemon (gmond
or ganglia-monitor
) to work.
The libganglia
sink has probably less overhead compared to the ganglia
sink because it does not require any process generation but initializes the environment and UDP connections only once.
Configuration structure
{
"<name>": {
"type": "libganglia",
"gmetric_config" : "/path/to/gmetric/config",
"cluster_name": "MyCluster",
"add_ganglia_group" : true,
"add_type_to_name": true,
"add_units" : true,
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink anlibganglia
sinkgmond_config
: Path to the Ganglia configuration filegmond.conf
(default:/etc/ganglia/gmond.conf
)cluster_name
: Set a cluster name for the metric. If not set, it is taken fromgmond_config
add_ganglia_group
: Add a Ganglia metric group based on meta information. Some old versions ofgmetric
do not support the--group
optionadd_type_to_name
: Ganglia commonly uses only node-level metrics but with cc-metric-collector, there are metrics for cpus, memory domains, CPU sockets and the whole node. In order to get eeng, this option prefixes the metric name with<type><type-id>_
ordevice_
depending on the metric tags and meta information. For metrics of the whole nodetype=node
, no prefix is addedadd_units
: Add metric value unit if there is aunit
entry in the metric tags or meta informationprocess_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
Ganglia Installation
My development system is Ubuntu 20.04. To install the required libraries with apt
:
$ sudo apt install libganglia1
The libganglia.so
gets installed in /usr/lib
. The Ganglia headers libganglia1-dev
are not required.
I added a Makefile
in the sinks
subfolder that searches for the library in /usr
and creates a symlink (sinks/libganglia.so
) for running/building the cc-metric-collector. So just type make
before running/building in the main folder or the sinks
subfolder.
7.4.5.6 - nats sink
nats
sink
The nats
sink publishes all metrics into a NATS network. The publishing key is the database name provided in the configuration file
Configuration structure
{
"<name>": {
"type": "nats",
"database" : "mymetrics",
"host": "dbhost.example.com",
"port": "4222",
"user": "exampleuser",
"password" : "examplepw",
"nkey_file": "/path/to/nkey_file",
"flush_delay": "10s",
"precision": "s",
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink annats
sinkdatabase
: All metrics are published with this subjecthost
: Hostname of the NATS serverport
: Port number (as string) of the NATS serveruser
: Username for basic authenticationpassword
: Password for basic authenticationnkey_file
: Path to credentials file with NKEYflush_delay
: Maximum time until metrics are sent out (default ‘5s’)precision
: Precision of the timestamp. Valid values are ’s’, ‘ms’, ‘us’ and ’ns’. (default is ’s’)process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
Using nats
sink for communication with cc-metric-store
The cc-metric-store only accepts metrics with a timestamp precision in seconds, so it is required to use "precision": "s"
.
7.4.5.7 - prometheus sink
prometheus
sink
The prometheus
sink publishes all metrics via an HTTP server ready to be scraped by a Prometheus server. It creates gauge metrics for all node metrics and gauge vectors for all metrics with a subtype like ‘device’, ‘cpu’ or ‘socket’.
Configuration structure
{
"<name>": {
"type": "prometheus",
"host": "localhost",
"port": "8080",
"path": "metrics",
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink anprometheus
sinkhost
: The HTTP server gets bound to that IP/hostnameport
: Portnumber (as string) for the HTTP serverpath
: Path where the metrics should be servered. The metrics will be published athost
:port
/path
group_as_namespace
: Most metrics contain a group as meta information like ‘memory’, ’load’. With this the metric names are extended togroup
_name
if possible.process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
7.4.5.8 - stdout sink
stdout
sink
The stdout
sink is the most simple sink provided by cc-metric-collector. It writes all metrics in InfluxDB line-procol format to the configurable output file or the common special files stdout
and stderr
.
Configuration structure
{
"<name>": {
"type": "stdout",
"meta_as_tags" : [],
"output_file" : "mylogfile.log",
"process_messages" : {
"see" : "docs of message processor for valid fields"
},
"meta_as_tags" : []
}
}
type
: makes the sink anstdout
sinkmeta_as_tags
: print meta information as tags in the output (optional)output_file
: Write all data to the selected file (optional). There are two ‘special’ files:stdout
andstderr
. If this option is not provided, the default value isstdout
process_messages
: Process messages with given rules before progressing or dropping, see here (optional)meta_as_tags
: print all meta information as tags in the output (deprecated, optional)
7.5 - Commit message naming conventions
Introduction
ClusterCockpit uses goreleaser for building and uploading releases. In this process the release notes including all notable changes are automatically generated based on special commit message tags. Moreover GitHub will parse special characters and words to link and close issues.
Reference issue tickets
It is good practice to always create a ticket for notable changes.
This allows to comment and discuss about source code changes. Any commit that
contributes to the ticket should reference the ticket id (in the commit message
or description). This is achieved in GitHub by prefixing the ticket id with a
number sign character (#
):
This change contributes to #235
GitHub will detect if a pull request or commit uses special keywords to close a ticket:
- close, closes, closed
- fix, fixes, fixed
- resolve, resolves, resolved
The ticket will not be closed before the commit appears on the main branch. Example:
This change fixes #423
Control release notes with preconfigured commit message prefixes
Commits with one of the following prefixes will appear in the release notes:
- feat: Mark a commit to contain changes related to new features
- fix: Mark a commit to contain changes related to bug fixes
- sec: Mark a commit to contain changes related to security fixes
- doc: Mark a commit to contain changes related to documentation updates
- [feat|fix] dep: Mark a commit that is related to a dependency introduction or change
7.6 - Docsy example page
This is a placeholder page. Replace it with your own content.
Text can be bold, italic, or strikethrough. Links should be blue with no underlines (unless hovered over).
There should be whitespace between paragraphs. Vape migas chillwave sriracha poutine try-hard distillery. Tattooed shabby chic small batch, pabst art party heirloom letterpress air plant pop-up. Sustainable chia skateboard art party banjo cardigan normcore affogato vexillologist quinoa meggings man bun master cleanse shoreditch readymade. Yuccie prism four dollar toast tbh cardigan iPhone, tumblr listicle live-edge VHS. Pug lyft normcore hot chicken biodiesel, actually keffiyeh thundercats photo booth pour-over twee fam food truck microdosing banh mi. Vice activated charcoal raclette unicorn live-edge post-ironic. Heirloom vexillologist coloring book, beard deep v letterpress echo park humblebrag tilde.
90’s four loko seitan photo booth gochujang freegan tumeric listicle fam ugh humblebrag. Bespoke leggings gastropub, biodiesel brunch pug fashion axe meh swag art party neutra deep v chia. Enamel pin fanny pack knausgaard tofu, artisan cronut hammock meditation occupy master cleanse chartreuse lumbersexual. Kombucha kogi viral truffaut synth distillery single-origin coffee ugh slow-carb marfa selfies. Pitchfork schlitz semiotics fanny pack, ugh artisan vegan vaporware hexagon. Polaroid fixie post-ironic venmo wolf ramps kale chips.
There should be no margin above this first sentence.
Blockquotes should be a lighter gray with a border along the left side in the secondary color.
There should be no margin below this final sentence.
First Header 2
This is a normal paragraph following a header. Knausgaard kale chips snackwave microdosing cronut copper mug swag synth bitters letterpress glossier craft beer. Mumblecore bushwick authentic gochujang vegan chambray meditation jean shorts irony. Viral farm-to-table kale chips, pork belly palo santo distillery activated charcoal aesthetic jianbing air plant woke lomo VHS organic. Tattooed locavore succulents heirloom, small batch sriracha echo park DIY af. Shaman you probably haven’t heard of them copper mug, crucifix green juice vape single-origin coffee brunch actually. Mustache etsy vexillologist raclette authentic fam. Tousled beard humblebrag asymmetrical. I love turkey, I love my job, I love my friends, I love Chardonnay!
Deae legum paulatimque terra, non vos mutata tacet: dic. Vocant docuique me plumas fila quin afuerunt copia haec o neque.
On big screens, paragraphs and headings should not take up the full container width, but we want tables, code blocks and similar to take the full width.
Scenester tumeric pickled, authentic crucifix post-ironic fam freegan VHS pork belly 8-bit yuccie PBR&B. I love this life we live in.
Second Header 2
This is a blockquote following a header. Bacon ipsum dolor sit amet t-bone doner shank drumstick, pork belly porchetta chuck sausage brisket ham hock rump pig. Chuck kielbasa leberkas, pork bresaola ham hock filet mignon cow shoulder short ribs biltong.
Header 3
This is a code block following a header.
Next level leggings before they sold out, PBR&B church-key shaman echo park. Kale chips occupy godard whatever pop-up freegan pork belly selfies. Gastropub Belinda subway tile woke post-ironic seitan. Shabby chic man bun semiotics vape, chia messenger bag plaid cardigan.
Header 4
- This is an unordered list following a header.
- This is an unordered list following a header.
- This is an unordered list following a header.
Header 5
- This is an ordered list following a header.
- This is an ordered list following a header.
- This is an ordered list following a header.
Header 6
What | Follows |
---|---|
A table | A header |
A table | A header |
A table | A header |
There’s a horizontal rule above and below this.
Here is an unordered list:
- Liverpool F.C.
- Chelsea F.C.
- Manchester United F.C.
And an ordered list:
- Michael Brecker
- Seamus Blake
- Branford Marsalis
And an unordered task list:
- Create a Hugo theme
- Add task lists to it
- Take a vacation
And a “mixed” task list:
- Pack bags
- ?
- Travel!
And a nested list:
- Jackson 5
- Michael
- Tito
- Jackie
- Marlon
- Jermaine
- TMNT
- Leonardo
- Michelangelo
- Donatello
- Raphael
Definition lists can be used with Markdown syntax. Definition headers are bold.
- Name
- Godzilla
- Born
- 1952
- Birthplace
- Japan
- Color
- Green
Tables should have bold headings and alternating shaded rows.
Artist | Album | Year |
---|---|---|
Michael Jackson | Thriller | 1982 |
Prince | Purple Rain | 1984 |
Beastie Boys | License to Ill | 1986 |
If a table is too wide, it should scroll horizontally.
Artist | Album | Year | Label | Awards | Songs |
---|---|---|---|---|---|
Michael Jackson | Thriller | 1982 | Epic Records | Grammy Award for Album of the Year, American Music Award for Favorite Pop/Rock Album, American Music Award for Favorite Soul/R&B Album, Brit Award for Best Selling Album, Grammy Award for Best Engineered Album, Non-Classical | Wanna Be Startin’ Somethin’, Baby Be Mine, The Girl Is Mine, Thriller, Beat It, Billie Jean, Human Nature, P.Y.T. (Pretty Young Thing), The Lady in My Life |
Prince | Purple Rain | 1984 | Warner Brothers Records | Grammy Award for Best Score Soundtrack for Visual Media, American Music Award for Favorite Pop/Rock Album, American Music Award for Favorite Soul/R&B Album, Brit Award for Best Soundtrack/Cast Recording, Grammy Award for Best Rock Performance by a Duo or Group with Vocal | Let’s Go Crazy, Take Me With U, The Beautiful Ones, Computer Blue, Darling Nikki, When Doves Cry, I Would Die 4 U, Baby I’m a Star, Purple Rain |
Beastie Boys | License to Ill | 1986 | Mercury Records | noawardsbutthistablecelliswide | Rhymin & Stealin, The New Style, She’s Crafty, Posse in Effect, Slow Ride, Girls, (You Gotta) Fight for Your Right, No Sleep Till Brooklyn, Paul Revere, Hold It Now, Hit It, Brass Monkey, Slow and Low, Time to Get Ill |
Code snippets like var foo = "bar";
can be shown inline.
Also, this should vertically align
with this
and this.
Code can also be shown in a block element.
foo := "bar";
bar := "foo";
Code can also use syntax highlighting.
func main() {
input := `var foo = "bar";`
lexer := lexers.Get("javascript")
iterator, _ := lexer.Tokenise(nil, input)
style := styles.Get("github")
formatter := html.New(html.WithLineNumbers())
var buff bytes.Buffer
formatter.Format(&buff, style, iterator)
fmt.Println(buff.String())
}
Long, single-line code blocks should not wrap. They should horizontally scroll if they are too long. This line should be long enough to demonstrate this.
Inline code inside table cells should still be distinguishable.
Language | Code |
---|---|
Javascript | var foo = "bar"; |
Ruby | foo = "bar"{ |
Small images should be shown at their actual size.
Large images should always scale down and fit in the content container.
The photo above of the Spruce Picea abies shoot with foliage buds: Bjørn Erik Pedersen, CC-BY-SA.
Components
Alerts
Note
This is an alert with a title.Note
This is an alert with a title and Markdown.Warning
This is a warning with a title.Another Heading
Add some sections here to see how the ToC looks like. Bacon ipsum dolor sit amet t-bone doner shank drumstick, pork belly porchetta chuck sausage brisket ham hock rump pig. Chuck kielbasa leberkas, pork bresaola ham hock filet mignon cow shoulder short ribs biltong.
This Document
Inguina genus: Anaphen post: lingua violente voce suae meus aetate diversi. Orbis unam nec flammaeque status deam Silenum erat et a ferrea. Excitus rigidum ait: vestro et Herculis convicia: nitidae deseruit coniuge Proteaque adiciam eripitur? Sitim noceat signa probat quidem. Sua longis fugatis quidem genae.
Pixel Count
Tilde photo booth wayfarers cliche lomo intelligentsia man braid kombucha vaporware farm-to-table mixtape portland. PBR&B pickled cornhole ugh try-hard ethical subway tile. Fixie paleo intelligentsia pabst. Ennui waistcoat vinyl gochujang. Poutine salvia authentic affogato, chambray lumbersexual shabby chic.
Contact Info
Plaid hell of cred microdosing, succulents tilde pour-over. Offal shabby chic 3 wolf moon blue bottle raw denim normcore poutine pork belly.
External Links
Stumptown PBR&B keytar plaid street art, forage XOXO pitchfork selvage affogato green juice listicle pickled everyday carry hashtag. Organic sustainable letterpress sartorial scenester intelligentsia swag bushwick. Put a bird on it stumptown neutra locavore. IPhone typewriter messenger bag narwhal. Ennui cold-pressed seitan flannel keytar, single-origin coffee adaptogen occupy yuccie williamsburg chillwave shoreditch forage waistcoat.
This is the final element on the page and there should be no margin below this.
8 - Web Interface
Home
The entrypoint for each login via the login mask is a table containing each configured cluster as a row with the following columns:
- Name: The configured clusters’ name
- Running Jobs: Number of Jobs currently running longer than 5 minutes (or configured
shortRunning
amount of time)- Clicking the Link will forward to the job list with preset filters for cluster and running jobs
- Total Jobs: Number of Jobs in the respective job-archive
- Clicking the Link will forward to the job list with preset filter for cluster
- Status View: Link to the status view of the respective cluster
- This column is only shown for users with admin authority.
- Systems View: Link to the nodes view view of the respective cluster
- This column is only shown for users with admin authority.
Navigation Bar
The navigation bar allows direct access to ClusterCockpits’ different views and functions. Depending on the users’ authorization, the selectable views can differ.
For most viewports, the navigation bar is rendered fully expanded:
Item | Title | Description |
---|---|---|
1 | Home Button | Leads back to the home table |
2 | Views | Leads to ClusterCockpits’ different views, will change dependent on user authority |
3 | Searchbar | Top-Level Searchbar, see full usage information here |
4 | Documentation | Leads to this Documentation |
5 | Settings | Leads to ClusterCockpit settings page |
6 | Logout | Logs out the active user |
Adaptive Render Versions
On smaller viewports, the navigation bar will be rendered in one of two collapsed states:
8.1 - Settings
The settings view allows non-privileged users to choose their preferred paging style, to customize how metric plots are rendered, and to generate personalized tokes for use with the API. Customization options include line width, number of plots per row (where applicable), whether backgrounds should be colored, and the color scheme of multi-line metric plots.
Privileged users will also find an administrative interface for handling local user accounts. This includes creating local accounts from the interface, editing user roles, listing and deleting existing users, generating JSON Web Tokens for API usage, and delegating managed projects for manager role users.
User Options
Field | Options | Note |
---|---|---|
Paging Type | Classic / Continuous | Style of paging in job lists |
Line Width | # Pixels | Width of the lines in the timeseries plots |
Plots Per Row | # Plots | How many plots to show next to each other on pages such as the job or nodes views |
Colored Backgrounds | Yes / No | Color plot backgrounds indicating mean values within warning thresholds |
Color Scheme | See Below | Render multi-line metric plots in different color ranges |
Generate JWT
This function will generate and return a personalized JWT, printed into the “Display JWT” field.
If working with the ClusterCockpit API, this token is required to authorize the user against the REST API endpoints.
Color Schemes
Name | Colors |
---|---|
Default | |
Autumn | |
Beach | |
BlueRed | |
Rainbow | |
Binary | |
GistEarth | |
BlueWaves | |
BlueGreenRedYellow |
Admin Options
Create User
New users can be created directly via the web interface. On successful creation a green response message will be returned, and the user is directly visible in the “Special Users” table - If the user has at least two roles, or a single role other than user
.
Error messages will also be displayed if the user creation process failed. No user account is saved to the database in this case.
Field | Option | Note |
---|---|---|
Username (ID) | string | Required, must be unique |
Password | string | Only API users are allowed to have a blank password, users with a blank password can only authenticate via JW tokens |
Project | string | Only manager users can have a project |
Name | string | Name of the user, optional, can be blank |
Email Address | string | Users email, optional, can be blank |
Role | Select one | See roles for more detailed information |
API | Allowed to interact with REST API | |
Default | User | Same as if created via LDAP sync |
Manager | Allows to inspect jobs and users of given project | |
Support | Allows to inspect jobs and users of all projects, has no admin view or settings access | |
Admin | General access |
Special Users
This table does not contain users who only have user
as their only role saved in the database. This is the case for all users created by LDAP import, and thus, these users will not be shown here. However, LDAP users’ roles can still be edited, and will appear in the table as soon as a authority higher than user
or two authorities were granted.
All other special case users, e.g. new users manually created with support
role, will appear in the list.
User accounts can be deleted by pressing the respective function displayed for each user entry - A verification pop-up window will appear to stop accidental user deletion.
Additionally, JWT tokens for specific users can be generated here as well.
Column | Example | Description |
---|---|---|
Username | abcd1 | Username of this user |
Name | Paul Atreides | Name of this user |
Project(s) | abcd | Managed project(s) of this user |
demo@demo.com | Email adress of this user | |
Roles | admin,api | Role(s) of this user |
JWT | Press button to reveal freshly generated token | Generate a JWT for this user for use with the CC REST API endpoints |
Delete | Press button to verify deletion | Delete this user |
Edit User Role
On creation, users can only have one role. However, it is allowed to assign multiple roles to an user account. The addition or removal of roles is performed here.
Enter an existing username
and select an existing (for removal) or new (for addition) role in the drop-down menu.
Then press the respective button to remove or add the selected authority from the user account. Errors will be displayed if existing roles are added or non-existing roles are removed.
Edit Managed Projects
On creation, users can only have one managed project. However, it is allowed to assign multiple projects to a manager account. The addition or removal of projects is performed here.
Enter an existing username
and select an existing (for removal) or new (for addition) project by entering the respective projectId
.
Then press the respective button to remove or add the selected project from the manager account. Errors will be displayed if existing projects are added, non-existing projects are removed, or if the user account is not authorized to manage projects at all.
Scramble Names (Presentation Mode)
Activating this switch will replace all user names, person names, and project names with random strings. Intended for presentations on a production system while retaining critical information from a publc audience.
Metric Plot Resampling
If “Resampling” of metric plots is enabled in the configuration file (config.json
), and read correctly on start-up, this informational display will list both the amount of data points on whichthe next resolution will be requested (“Trigger”) as well as the applicable resolutions themselves.
Note: Changes to the resampling options have to be perfofmed by changing the configuration file and restarting the application.
Edit Notice Shown On Homepage
The contents of the text form field will be written into $CCPATH/var/notice.txt
on submission. If this file does not exist, it will be created.
If any content is found, an informational card will be rendered above the home site table. The content will also be mirrored within the form field itself.
Removing any content from the form field, and submitting, will clear the file and remove the rendered card from the homepage. This state is indicated by the placeholder text “No Content.” being shown in the form field.
8.2 - Searchbar
The top searchbar will handle page wide searches either by entering a searchterm directly as <query>
, or by using a “keyword” implemented in the form of <keyword>:<query>
. Entering a searchterm directly will start a hierarchical search which will return the first match in the hierarchy (see table below). It is recommended to supply the search with a keyword to specify the searched entity. For example, jobName:myJobName
will specifically search for all jobs which have the queried string (or a part thereof) in their metadata jobName
field. For all keywords with examples, see the table below.
Both keywords and queries are trimmed of all spaces before performing the search, returning the same results independently of location and number of spaces, e.g. name : Paul
and name: paul
are both handled identically.
Unprocessable queries will return a message detailing the cause of the error.
Available Keywords
Keyword | Example Query | Destination | Note |
---|---|---|---|
No Keyword Used | abcd100 | Joblist or User Joblist | Performs hierarchical search jobId -> username -> name -> projectId -> jobName |
JobId | jobId:123456 | Joblist | Allows multiple identical matches, e.g. JobIds from different clusters |
JobName | jobName:myJobName | Joblist | Works with partial queries. Allows multiple identical matches, e.g. JobNames from different clusters. An additional Last 30 Days filter is active by default. |
ProjectId | projectId:abcd100 | Joblist | All Jobs of the given project |
Username | username:abcd100a | Users Table | Only active users are returned. Users without jobs are not shown. An additional Last 30 Days filter is active by default. Admin Only |
Name | name:Paul | Users Table | Works with partial queries. Only active users are returned. Users without jobs are not shown. An additional Last 30 Days filter is active by default. Admin Only |
ArrayJobId | arrayJobId:891011 | Joblist | All Jobs of the given arrayJobId. An additional Last 30 Days filter is active by default. |
8.3 - Plots
Most plots visible in the ClusterCockpit webinterface are implemented via uPlot or Chart.js, which both offer various functionality to the user.
Metric Plots
The main plot component of ClusterCockpit renders the metric values retrieved from the systems in a time dependent manner.
Interactivity
A selector crosshair is shown when hovering over the rendered data, data points corresponding to the legend are highlighted.
It is possible to zoom in by dragging a selection square with your mouse. Double-Clicking into the plot will reset the zoom.
normal
metric threshold at first, i.e. the threshold will either be the highest rendered value (spaced line), or will be used to cut-off outliers (10 x normal threshold). Resetting by double-clicking will re-render the plot with regard to the highest value of the dataset, i.e. adapt the Y-axis to match said maximum value.Resampling of Data
If “Resampling” of metric plots is enabled in the configuration file (config.json
), data is primarily loaded on the coarsest resolution. Zooming into the dataset, as described above, will continuously trigger a reload of the data in finer resolutions, until the highest resolution is reached. A finer resolution is requested from the backend as soon as the number of visible data points falls below a configured amount (“Trigger”).
Please note: While archived data is read from disk, and therefore can be resampled in the backend directly, resampling of data for running
jobs requires the use of a matching version of CC-Metric-Store.
Running Job metric data read from older versions of CCMS will still return correctly, but will always return in the metrics configured timestep.
Conditional Legends
Hovering over the rendered data will display a legend as hovering box colored in yellow. Depending on the amount of data shown, this legend will render differently:
- Single Dataset: Runtime and Dataset Identifier Only
- 2 to 6 Datasets: Runtime, Line Color and Dataset Identifier
- 7 to 12 Datasets: Runtime and Dataset Identifier Only
- More than 12 Datasets: No Legend
- Statistics Datasets: Runtime and Dataset Identifier Only (See below)
The “no legend” case is required to not clutter the display in case of high data volume, e.g. core granularity data for more than 128 cores, which would result in 128 legend entries, possibly blocking the plotting area of metric graphs below.
Example
Colored Backgrounds
The plots’ background is colored depending the average value of the viewed metric in respect to its configured threshold values. The three cases are
- White: Metric average within expected parameters. No performance impact.
- Yellow: Metric average below expected parameters, but not yet critical. Possible performace impact.
- Red: Metric average unexpectedly low. Indicator for suboptimal usage of resources. Performance impact to be expected.
Example
Statistics Variant
In the job list views, high amounts of data are by default rendered as a statistical representation of the numerous, single datasets:
- Maximum: The maximum values of the base datasets of each point in time, over time. Colored in green.
- Median: The median values of the base datasets of each point in time, over time. Colored in black.
- Minimum: The minimal values of the base datasets of each point in time, over time. Colored in red.
Example
Histograms
Histograms display (binned) data allowing distributions of the repective data source to be visualized. Data highlighting, zooming, and resetting the zoom work as described for metric plots.
Example
Roofline Plot
A roofline plot, or roofline model, represents the utilization of available resources as the relation between computation and memory usage.
Dotted Roofline
Roofline models rendered as dotted plots display the utilization of hardware resources over time.
Example
Heatmap Roofline
The roofline model shown in the analysis view, as the single exception, is rendered as a heatmap. This is due to the data being displayed is derived from a number of jobs greater than one, since the analysis view returns all jobs matching the selected filters. The roofline therefore colors regions of accumulated activity in increasing shades of red, depicting the regions below the roofs in which the returned jobs primarily perform.
Example
Polar Plots
A polar, or radar, plot represents the utilization of key metrics. Both the maximum and the average utilization as a fraction of the 100% theoretical maximum (labelled as 1.0
) are rendered on a number of axes equal to the displayed key metrics. This leads to an increasing area, which in return marks increasingly optimal resource usage. In principle, this is a graphic representation of data also shown in the footprint component.
By clicking on one of the two legends, the respective dataset will be hidden. This can be useful if high overlap reduces visibility.
Example
Scatter / Bubble Plot
Bubble scatter plots show the position of the averages of two selected metrics in relation to each other.
Each circle represents one job, while the size of a circle is proportional to its node hours. Darker circles mean multiple jobs have the same averages for the respective metric selection.
Example
8.4 - Filters
The ClusterCockpit filter component is used for reducing the number of jobs, either for direct display in job list views, or to specifiy the data-source for collecting information displayed in user or project tables, as well as the analysis view.
Filter Options
Multiple filters can be easily combined by selecting more than one option of the available filters.
By clicking on the respective filter pill, colored in blue, and located right of the filter component, one can directly access the respective filters’ menu for editing, or removing, the filter.
At the moment, the following filters are implemented:
Cluster/Partition
Select a configured cluster, or a specified partition of a given cluster, and display only jobs started on that cluster (and partition).
Options: All cluster names, and nested partition names, configured in config.json
Default: Any Cluster (Any Partition)
Job States
Select one or more job states, and display only jobs matching the selected criteria.
Options: running, completed, failed, cancelled, stopped, timeout, preempted, out_of_memory
Default: All states
Start Time
Select the timeframe in which jobs were started, and display only jobs matching the selected criteria.
Options: Free selection of date dd.mm.YYYY
and time hh:mm
for from
and to
limits.
Default: All Starttimes
Preset: Jobs started one month ago until $now
Duration
Select the duration of jobs, and display only jobs matching the selected criteria.
Options: Duration less than hh:mm
, duration more than hh:mm
, duration between two duration selections. Only one of the three options can be used at a time.
Default: All Durations
Tags
Select one or more job tags, and display only jobs tagged with the selected tags.
Options: All available tags. It is possible to search within the list of tags.
Default: No selection
Resources
Select a named node or specify an amount of used resources, and display only jobs matching the selected criteria.
Options:
- Named node free text field: Enter a hostname here to only return jobs which were ran on this node.
- Range selectors: Select a range of allocated job resources ranging from the minimal to the maximum configured resource count of all clusters. If the cluster filter is set, the ranges are limited to the respective resources’ configuration. Available resources are:
- Nodes
- HWThreads
- Accelerators (if available)
Default: No named node, full resource ranges of all configured clusters
Energy
Specify total consumed energy, and display only jobs matching the selected range.
Options: “Total Job Energy” in kWh.
Default: No selection
running
.Statistics
Specify ranges of metric statistics, and display only jobs matching the selected criteria.
footprint
flag is set in the respective metrics’ configuration will be available here.Example Options:
- FLOPs (Avg.): Select Range
From-To
by dragging the slider or entering values directly. - Memory Bandwith (Avg.): Select Range
From-To
by dragging the slider or entering values directly. - Load (Avg.): Select Range
From-To
by dragging the slider or entering values directly. - Memory Used (Max.): Select Range
From-To
by dragging the slider or entering values directly.
Default: Full metric statistics ranges as configured
Start Time Quick Selections
Quickly select a preconfigured range of job start times. Will display as named start time filter.
When the returned URL is copied and shared, and the named filter value will transfer over.
Options: Last 6 hours, Last 24 hours, Last 7 Days, Last 30 Days
Default: No selection
8.5 - Views
Usage descriptions for each view of the ClusterCockpit web interface.
8.5.1 - My Jobs
The “My Jobs” View is available to all users regardless of authority and displays the users personal jobs, i.e. jobs started by this users username on the cluster systems.
The view is a personal variant of the user job view and therefore also consists of three components: Basic Information about the users jobs, selectable statistic histograms of the jobs, and a generalized job list.
Users are able to change the sorting, select and reorder the rendered metrics, filter, and activate a periodic reload of the data.
User Information and Basic Distributions
The top row always displays personal usage information, independent of the selected filters.
Additional histograms depicting the distribution of job duration and number of nodes occupied by the returned jobs are affected by the selected filters.
Information displayed:
- Username
- Total Jobs
- Short Jobs (as defined by the configuration, default: less than 300 second runtime)
- Total Walltime
- Total Core Hours
Selectable Histograms
Histograms depicting the distribution of the selected jobs’ statistics can be selected from the top navbar “Select Histograms” button. The displayed data is based on the jobs returned from active filters, and will be pulled from the database.
footprint
flag is set in the respective metrics’ configuration will be available here.Job List
The job list displays all jobs started by your username on the systems. Additional filters will always respect this limitation. For a detailed description of the job list component, see the related documentation.
8.5.2 - User Jobs
The “User Jobs” View is only available to management and supporting staff and displays jobs of the selected user, i.e. jobs started by this users username on the cluster systems.
The view consists of three components: Basic Information about the users jobs, selectable statistic histograms of the jobs, and a generalized job list.
Users are able to change the sorting, select and reorder the rendered metrics, filter, and activate a periodic reload of the data.
User Information and Basic Distributions
The top row always displays information about the user, independent of the selected filters.
Additional histograms depicting the distribution of job duration and number of nodes occupied by the returned jobs are affected by the selected filters.
Information displayed:
- Username
- Total Jobs
- Short Jobs (as defined by the configuration, default: less than 300 second runtime)
- Total Walltime
- Total Core Hours
Selectable Histograms
Histograms depicting the distribution of the selected jobs’ statistics can be selected from the top navbar “Select Histograms” button. The displayed data is based on the jobs returned from active filters, and will be pulled from the database.
footprint
flag is set in the respective metrics’ configuration will be available here.Job List
The job list displays all jobs started by this users username on the systems. Additional filters will always respect this limitation. For a detailed description of the job list component, see the related documentation.
8.5.3 - Job List
The primary view of ClusterCockpits webinterface is the tabular listing of jobs, which displays various information about the jobs returned by the selected filters. This information includes the jobs’ full meta data, such as runtime or job state, as well as an optional footprint, allowing quick assessment of the jobs performance.
Most importantly, the list displays a selectable array of metrics as time dependent metric plots, which allows detailed insight into the jobs performance at a glance.
manager
role, this view is labelled as ‘Managed Jobs’. Displayed jobs are limited to jobs started by users of the managed projects (usergroups), otherwise the functionality is identical, e.g. filtering or footprint display.Job List Toolbar
Several options allow configuration of the displayed data, which are also persisted for each user individually, either for general usage or by cluster.
Sorting
Basic selection of sorting parameter and direction. By default, jobs are sorted by starting timestamp in descending order (latest jobs first). Other selections to sort by are
- Duration
- Number of Nodes
- Number of Hardware-Threads
- Number of Accelerators
- Total Energy Consumed
- Additional configured Metric Statistics
- …
footprint
flag is set in the respective metrics’ configuration will be available as additional sorting options.Switching of the sort direction is achieved by clicking on the arrow icon next to the desired sorting parameter.
Metrics
Selection of metrics shown in the tabular view for each job. The list is compiled from all available configured metrics of the ClusterCockpit instance, and the tabular view will be updated upon applying the changes.
In addition to the metric names themselves, the availability by cluster is indicated as comma seperated list next to the metric identifier. This information will change to the availablility by partition if the cluster filer is active.
It is furthermore possible to edit the order of the selected metrics. This can be achieved by dragging and dropping the metric selectors to the desired order, where the topmost metric will be displayed next to the “Job Info” column, and additional metrics will be added on the right side.
Lastly, the optional “Footprint” Column can be activated (and deactivated) here. It will always be rendered next to the “Job Info” column, while metrics start right of the “Footprint” column, if activated.
Filters
Selection of filters applied to the queried jobs. By default, no filters are activated if the view was opened via the navigation bar. At multiple location throughout the web-interface, direct links will lead to this view with one or more preset filters active, e.g. selecting a clusters’ “running jobs” from the home page will open this view displaying only running jobs of that cluster.
Possible options are:
- Cluster/Partition: Filter by configured cluster (and partitions thereof)
- Job State: Filter by defined job state(s)
- Start Time: Filter by start timestamp
- Duration: Filter by job duration
- Tags: Filter by tags assigned to jobs
- Resources: Filter by allocated resources or named node
- Energy: Filter by consumed total energy (for completed jobs only)
- Statistics: Filter by average usage of defined metrics
Each filter and its default value is described in detail here.
Job Count
The total number of jobs returned by the backend for the given set of filters.
Search and Reload
Search for specific jobname, project or username (privileged only) using the searchbox by selecting from the dropdown and entering the query.
Force a complete reload of the table data, or set a timed periodic reload (30, 60, 120, 300 Seconds).
Search for specific project
If the Job-List was opened via a ProjectId-Link or the Projects List, the text search will be fixed to the selected project, and allows for filtering jobnames and users in that project, as indicated by the placeholder text.
If desired, the fixed project can be removed by pressing the button right of the input field, returning the joblist to its default state.
Job List Table
The main component of the job list view renders data pulled from the database, the job archive (completed jobs) and the configured metric data source (running jobs).
Job Info
The meta data containing general information about the job is represented in the “Job Info” column, which is always the first column to be rendered. From here, users can navigate to the detailed view of one specific job as well as the user or project specific job lists.
Field | Example | Description | Destination |
---|---|---|---|
Job Id | 123456 | The JobId of the job assigned by the scheduling daemon | Job View |
Job Name | myJobName | The name of the job as supplied by the user | - |
Username | abcd10 | The username of the submitting user | User Jobs |
Project | abcd | The name of the usergroup the submitting user belongs to | Joblist with preset Filter |
Resources | n100 | Indicator for the allocated resources. Single resources will be displayed by name, i.e. exclusive single-node jobs or shared resources. Multiples of resources will be indicated by icons for nodes, CPU Threads, and accelerators. | - |
Partition | main | The cluster partition this job was startet at | - |
Start Timestamp | 10.1.2024, 10:00:00 | The epoch timestamp the job was started at, formatted for human readability | - |
Duration | 0:21:10 | The runtime of the job, will be updated for running jobs on reload. Additionally indicates the state of the job as colored pill | - |
Walltime | 24:00:00 | The allocated walltime for the job as per job submission script | - |
Footprint
The optional footprint column will show base metrics for job performance at a glance, and will hint to performance (and performance problems) in regard to configurable metric thresholds.
footprint
flag is set in the respective metrics’ configuration will be shown in this view.Examples:
Field | Description | Note |
---|---|---|
cpu_load | Average CPU utilization | - |
flops_any | Floprate calculated as f_any = (f_double x 2) + f_single | - |
mem_bw | Average memory bandwidth used | Non-GPU Cluster only |
mem_used | Maximum memory used | Non-GPU Cluster only |
acc_utilization | Average accelerator utilization | GPU Cluster Only |
Colors and icons differentiate between the different warning states based on the configured threshold of the metrics. Reported metric values below the warning threshold simply report bad performance in one or more metrics, and should therefore be inspected by the user for future performance improvement.
Metric values colored in blue, however, usually report performance above the expected levels - Which is exactly why these metrics should be inspected as well. The “maximum” thresholds are often the theoretically achievable performance by the respective hardware component, but rarely are they actually reached. Inspecting jobs reporting back such levels can lead to averaging errors, unrealistic spikes in the metric data or even bugs in the code of ClusterCockpit.
Color | Level | Description | Note |
---|---|---|---|
Blue | Info | Metric value below maximum configured peak threshold | Job performance above expected parameters - Inspection recommended |
Green | OK | Metric value below normal configured threshold | Job performance within expected parameters |
Yellow | Caution | Metric value below configured caution threshold | Job performance might be impacted |
Red | Warning | Metric value below configured warning threshold | Job performance impacted with high probability - Inscpection recommended |
Dark Grey | Error | Metric value extremely above maximum configured threshold | Inspection required - Metric spikes in affected metrics can lead to errorneous average values |
Metric Row
Selected metrics are rendered here in the selected order as metric lineplots. Aspects of the rendering can be configured at the settings page.
8.5.4 - Job
The job view displays all data related to one specific job in full detail, and allows detailed inspection of all metrics at several scopes, as well as manual tagging of the job.
Top Bar
The top bar of each job view replicates the “Job Info” and “Footprint” seen in the job list, and additionally renders general metric information in specialized plots.
For shared jobs, a list of jobs which run (or ran) concurrently is shown as well.
Job Info
Identical to the job list equivalent, this component displays meta data containing general information about the job. From here, users can navigate to the detailed view of one specific job as well as the user or project specific job lists.
Field | Example | Description | Destination |
---|---|---|---|
Job Id | 123456 | The JobId of the job assigned by the scheduling daemon | Job View |
Job Name | myJobName | The name of the job as supplied by the user | - |
Username | abcd10 | The username of the submitting user | User Jobs |
Project | abcd | The name of the usergroup the submitting user belongs to | Joblist with preset Filter |
Resources | n100 | Indicator for the allocated resources. Single resources will be displayed by name, i.e. exclusive single-node jobs or shared resources. Multiples of resources will be indicated by icons for nodes, CPU Threads, and accelerators. | - |
Partition | main | The cluster partition this job was startet at | - |
Start Timestamp | 10.1.2024, 10:00:00 | The epoch timestamp the job was started at, formatted for human readability | - |
Duration | 0:21:10 | The runtime of the job, will be updated for running jobs on reload. Additionally indicates the state of the job as colored pill | - |
Walltime | 24:00:00 | The allocated walltime for the job as per job submission script | - |
At the bottom, all tags attached to the job are listed. Users can manage attachted tags via the “manage X Tag(s)” button.
Concurrent Jobs
In the case of a shared job, a second tab next to the job info will display all jobs which were run on the same hardware at the same time. “At the same time” is defined as “has a starting or ending time which lies between the starting and ending time of the reference job” for this purpose.
A cautious period of five minutes is applied to both limits, in order to restrict display of jobs which have too little overlap, and would just clutter the resulting list of jobs.
Each overlapping job is listed with its jobId
as a link leading to this jobs detailed job view.
Footprint
Identical to the job list equivalent, this component will show base metrics for job performance at a glance, and will hint to job quality and problems in regard to configurable metric thresholds. In contrast to the job list, it is always active and shown in the detailed job view.
footprint
flag is set in the respective metrics’ configuration will be shown in this view.Examples:
Field | Description | Note |
---|---|---|
cpu_load | Average CPU utilization | - |
flops_any | Floprate calculated as f_any = (f_double x 2) + f_single | - |
mem_bw | Average memory bandwidth used | - |
mem_used | Maximum memory used | Non-GPU Cluster only |
acc_utilization | Average accelerator utilization | GPU Cluster Only |
Colors and icons differentiate between the different warning states based on the configured thresholds of the metrics. Reported metric values below the warning threshold simply report bad performance in one or more metrics, and should therefore be inspected by the user for future performance improvement.
Metric values colored in blue, however, usually report performance above the expected levels - Which is exactly why these metrics should be inspected as well. The “maximum” thresholds are often the theoretically achievable performance by the respective hardware component, but rarely are they actually reached. Inspecting jobs reporting back such levels can lead to averaging errors, unrealistic spikes in the metric data or even bugs in the code of ClusterCockpit.
Color | Level | Description | Note |
---|---|---|---|
Blue | Info | Metric value below maximum configured peak threshold | Job performance above expected parameters - Inspection recommended |
Green | OK | Metric value below normal configured threshold | Job performance within expected parameters |
Yellow | Caution | Metric value below configured caution threshold | Job performance might be impacted |
Red | Warning | Metric value below configured warning threshold | Job performance impacted with high probability - Inspection recommended |
Dark Grey | Error | Metric value extremely above maximum configured threshold | Inspection required - Metric spikes in affected metrics can lead to errorneous average values |
Examples
Polar Representation
Next to the footprints, a second tab will render the polar plot representation of the configured footprint metrics. Both the maximum and the average are rendered.
Roofline Representation
A roofline plot representing the utilization of available resources as the relation between computation and memory usage over time (color scale blue -> red).
Metric Plot Table
The views’ middle section consists of metric plots for each metric selected in the “Metrics” selector, which defaults to all configured metrics.
The data shown per metric defaults to the smallest available granularity of the metric with data of all nodes, but can be changed at will by using the drop down selectors above each plot.
If available, the statistical representation can be selected as well, by scope (e.g. stats series (node)
).
Tagging
Manual tagging of jobs is performed by using the “Manage Tags” option.
Tags are categorized into three “Scopes” of visibility:
- Admin: Only administrators can create and attach these tags. Only visible for administrators and support personnel.
- Global: Administrators and support personnel can create and attach these tags. Visible for everyone.
- Private: Everyone can create and attach private tags, only visible to the creator.
Available tags are listed, colored by scope, and can be added to the jobs’ database entry simply by pressing the respective button.
The list can be filtered for specific tags by using the “Search Tags” prompt.
New tags can be created by entering a new type:name
combination in the search prompt, which will display a button for creating this new tag. Privileged users](/docs/explanation/roles/#administrator-role “Admin Role”) will additionally be able to select the “Scope” (see above) of the new tag.
Statistics and Meta Data
On the bottom of the job view, additional information about the job is collected. By default, the statistics of selected metrics are shown in tabular form, each in their metrics’ native granularity.
Statistics Table
The statistics table collects all metric statistical values (min, max, avg) for each allocated node and each granularity.
The metrics to be displayed can be selected using the “Metrics” selection pop-up window. In the header, next to the metric name, a second drop down allows the selection of the displayed granularity.
Core and Accelerator metrics default to their respective native granularities automatically.
Job Script
This tab displays the job script with which whis job was started on the systems.
Slurm Info
THis tab displays information returned drom the SLURM batch process management software.
8.5.5 - Users
This view lists all users which are, and were, active on the configured clusters. Information about the total number of jobs, walltimes and calculation usages are shown.
It is possible to filter the list by username using the equally named prompt, which also accepts partial queries.
The filter component allows limitation of the returned users based on job parameters like start timestamp or memory usage.
The table can be sorted by clicking the respective icon next to the column headers.
manager
authority, this view will be titled ‘Managed Users’ in the navigation bar. Managers will only be able to see other user accounts of the managed projects.Details
Column | Description | Note |
---|---|---|
User Name | The user account jobs are associated with | Links to the users’ job list with preset filter returning only jobs of this user and additional histograms |
Name | The name of user | |
Total Jobs | Users’ total of all started jobs | |
Total Walltime | Users’ total requested walltime | |
Total Core Hours | Users’ total of all used core hours | |
Total Accelerator Hours | Users’ total of all used accelerator hours | Please Note: This column is always shown, and will return 0 for clusters without installed accelerators |
8.5.6 - Projects
This view lists all projects (usergroups) which are, and were, active on the configured clusters. Information about the total number of jobs, walltimes and calculation usages are shown.
It is possible to filter the list by project name using the equally named prompt, which also accepts partial queries.
The filter component allows limitation of the returned projects based on job parameters like start timestamp or memory usage.
The table can be sorted by clicking the respective icon next to the column headers.
manager
authority, this view will be titled ‘Managed Projects’ in the navigation bar. Managers will only be able to see colected data of managed projects.Details
Column | Description | Note |
---|---|---|
Project Name | The project (usergoup) jobs are associated with | Links to a job list with preset filter returning only jobs of this project |
Total Jobs | Project total of all started Jobs | |
Total Walltime | Project total requested walltime | |
Total Core Hours | Project total of all used core hours used | |
Total Accelerator Hours | Project total of all used accelerator hours | Please Note: This column is always shown, and will return 0 for clusters without installed accelerators |
8.5.7 - Tags
This view lists all tags currently used within the ClusterCockpit instance:
- The
Tag Type
of the tag(s) is displayed as dark grey header, collecting all tags which share it, with a total count shown on the right. - The
Name
s of all tags sharing oneTag Type
, the number of matching jobs per name, and the scope are rendered as pills below the header, colored accordingly (see below).
Each tags’ pill is clickable, and leads to a job list with a preset filter matching only jobs tagged with this specific label.
Tags are categorized into three “Scopes” of visibility, and colored accordingly:
- Admin (Cyan): Only administrators can create and attach these tags. Only visible for administrators and support personnel.
- Global (Purple): Administrators and support personnel can create and attach these tags. Visible for everyone.
- Private (Yellow): Everyone can create and attach private tags, only visible to the creator.
8.5.8 - Nodes
The nodes view, or systems view, is always called in respect to one specified cluster. It displays the current state of all nodes in that cluster in respect to one selected metric, rendered in form of metric plots, and independent of job meta data, i.e. without consideration for job start and end timestamps.
Selection Bar
Selections regarding the display, and update, of the plots rendered in the node table can be performed here:
- Find Node:: Filter the node table by hostname. Partial queries are possible.
- Displayed Timerange: Select the timeframe to be rendered in the node table
Custom
: Select timestampfrom
andto
in which the data should be fetched. It is possible to select date and time.15 Minutes, 30 Minutes, 1 Hour, 2 Hours, 4 Hours, 12 Hours, 24 Hours
- Metric:: Select the metric to be fetched for all nodes. If no data can be fetched, messages are displayed per node.
- (Periodic) Reload: Force reload of fresh data from the backend or set a periodic reload in specified intervals
30 Seconds, 60 Seconds, 120 Seconds, 5 Minutes
Node Table
Nodes (hosts) are ordered alphanumerically in this table, rendering the selected metric in the selected timeframe.
Each heading links to the singular node view of the respective host.
8.5.9 - Node
The node view is always called in respect to one specified cluster and one specified node (host). It displays the current state of all metrics for that node, rendered in form of metric plots, and independent of job meta data, i.e. without consideration for job start and end timestamps.
Selection Bar
Information and selections regarding the data of the plots rendered in the node table can be performed here:
- Name: The hostname of the selected node
- Displayed Timerange: Select the timeframe to be rendered in the node table
Custom
: Select timestampfrom
andto
in which the data should be fetched. It is possible to select date and time.15 Minutes, 30 Minutes, 1 Hour, 2 Hours, 4 Hours, 12 Hours, 24 Hours
- Activity: Number of jobs currently allocated to this node. Exclusively used nodes will always display
1
if a job is running at the moment, or0
if not.- The “Show List”-Bitton leads to the joblist with preset filter fetching only currently allocated jobs on this node.
- (Periodic) Reload: Force reload of fresh data from the backend or set a periodic reload in specified intervals
30 Seconds, 60 Seconds, 120 Seconds, 5 Minutes
Node Table
Metrics are ordered alphanumerically in this table, rendering each metric in the selected timeframe.
8.5.10 - Analysis
The analysis view is always called in respect to one specified cluster. It collects and renders data based on the jobs returned by the active filters, which can be specified to a high detail, allowing analysis of specific aspects.
General Information
The general information section of the analysis view is always rendered and consists of the following elements
Totals
Total counts of collected data based on the returned jobs matching the requested filters:
- Total Jobs
- Total Short Jobs (By default defined as jobs shorter than 5 minutes)
- Total Walltime
- Total Node Hours
- Total Core Hours
- Total Accelerator Hours
Top Users and Projects
The ten most active users or projects are rendered in a combination of pie chart and tabular legend with values displayed. By default, the top ten users with the most jobs matching the selected filters will be shown.
Hovering over one of the pie chart fractions will display a legend featuring the identifier and value of the selected parameter.
The selection can be changed directly in the headers of the pie chart and the table, and can be changed to
Element | Options |
---|---|
Pie Chart | Users, Projects |
Table | Walltime, Node Hours, Core Hours, Accelerator Hours |
The selection is saved for each user and cluster, and will select the last chosen types of list as default the next time this view is opened.
“User Names” and “Project Codes” are rendered as links, leading to user job lists or project job lists with preset filters for cluster and entity ID.
Heatmap Roofline
A roofline plot representing the utilization of available resources as the relation between computation and memory for all jobs matching the filters. In order to represent the data in a meaningful way, the time information of the raw data is abstracted and represented as a heat map, with increasingly red sections of the roofline plot being the most populated regions of utilization.
Histograms
Two histograms depicting the duration and number of allocated cores distributions for the returned jobs matching the filters.
Selectable Data Representations
The second half of the analysis view consists of areas reserved for rendering user-selected data representations.
- Select Plots for Histograms: Opens a selector listing all configured metrics of the respective cluster. One or more metrics can be selected, and the data returned will be rendered as average distributions normalized by node hours (core hours, accelerator hours; depending on the metric).
- Select Plots in Scatter Plots: Opens a selector which allows selection of user chosen combinations of configured metrics for the respective cluster. Selected duplets will be rendered as scatter bubble plots for each selected pair of metrics.
Average Distribution Histograms
These histograms show the distribution of the normalized averages of all jobs matching the filters, split into 50 bins for high detail.
Normalization is achieved by weighting the selected metric data job averages by node hours (default), or by either accelerator hours (for native accelerator scope metrics) or core hours (for native core scope metrics).
User Defined Scatterplots
Bubble scatter plots show the position of the averages of two selected metrics in relation to each other.
Each circle represents one job, while the size of a circle is proportional to its node hours. Darker circles mean multiple jobs have the same averages for the respective metric selection.
8.5.11 - Status
The status view is always called in respect to one specified cluster. It displays the current state of utilization of the respective clusters resources, as well as user and project top lists and distribution histograms of the allocated resources per job.
2 Minutes
.Utilization Information
For each subluster, utilization is displayed in two parts rendered in one row.
Gauges
Simple gauge representation of the current utilization of available resources
Field | Description | Note |
---|---|---|
Allocated Nodes | Number of nodes currently allocated in respect to maximum available | - |
Flop Rate (Any) | Currently achieved flop rate in respect to theoretical maximum | Floprate calculated as f_any = (f_double x 2) + f_single |
MemBW Rate | Currently achieved memory bandwidth in respect to technical maximum | - |
Roofline
A roofline plot representing the utilization of available resources as the relation between computation and memory for each currently allocated, running job at the time of the latest data retrieval. Therefore, no time information is represented (all dots in blue, representing one job each).
Top Users and Projects
The ten most active users or projects are rendered in a combination of pie chart and tabular legend. By default, the top ten users or projects with the most allocated, running jobs are listed.
The selection can be changed directly in the tables header at Number of ...
, and can be changed to
- Jobs (Default)
- Nodes
- Cores
- Accelerators
The selection is saved for each user and cluster, and will select the last chosen type of list as default the next time this view is rendered.
Hovering over one of the pie chart fractions will display a legend featuring the identifier and value of the selected parameter.
“User Names” and “Project Codes” are rendered as links, leading to user job lists or project job lists with preset filters for cluster, entity ID, and state == running
.
Statistic Histograms
Several histograms depicting the utilization of the clusters resources, based on all currently running jobs are rendered here:
- Duration Distribution
- Number of Nodes Distribution
- Number of Cores Distribution
- Number of Accelerators Distribution
Additional Histograms showing specified footprint metrics across all systems can be selected via the “Select histograms” menu next to the refresher tool.
footprint
flag is set in the respective metrics’ configuration will be shown.