1 - BeeGFS on Demand collector
Toplevel beegfsmetaMetric
BeeGFS on Demand
collector
This Collector is to collect BeeGFS on Demand (BeeOND) metadata clientstats.
"beegfs_meta": {
"beegfs_path": "/usr/bin/beegfs-ctl",
"exclude_filesystem": [
"/mnt/ignore_me"
],
"exclude_metrics": [
"ack",
"entInf",
"fndOwn"
]
}
The BeeGFS On Demand (BeeOND)
collector uses the beegfs-ctl
command to read performance metrics for
BeeGFS filesystems.
The reported filesystems can be filtered with the exclude_filesystem
option
in the configuration.
The path to the beegfs-ctl
command can be configured with the beegfs_path
option
in the configuration.
When using the exclude_metrics
option, the excluded metrics are summed as other
.
Important: The metrics listed below, are similar to the naming of BeeGFS. The Collector prefixes these with beegfs_cstorage
(beegfs client storage).
For example beegfs metric open
-> beegfs_cstorage_open
Available Metrics:
- sum
- ack
- close
- entInf
- fndOwn
- mkdir
- create
- rddir
- refrEnt
- mdsInf
- rmdir
- rmLnk
- mvDirIns
- mvFiIns
- open
- ren
- sChDrct
- sAttr
- sDirPat
- stat
- statfs
- trunc
- symlnk
- unlnk
- lookLI
- statLI
- revalLI
- openLI
- createLI
- hardlnk
- flckAp
- flckEn
- flckRg
- dirparent
- listXA
- getXA
- rmXA
- setXA
- mirror
The collector adds a filesystem
tag to all metrics
2 - BeeGFS on Demand collector
Toplevel beegfsstorageMetric
BeeGFS on Demand
collector
This Collector is to collect BeeGFS on Demand (BeeOND) storage stats.
"beegfs_storage": {
"beegfs_path": "/usr/bin/beegfs-ctl",
"exclude_filesystem": [
"/mnt/ignore_me"
],
"exclude_metrics": [
"ack",
"storInf",
"unlnk"
]
}
The BeeGFS On Demand (BeeOND)
collector uses the beegfs-ctl
command to read performance metrics for BeeGFS filesystems.
The reported filesystems can be filtered with the exclude_filesystem
option
in the configuration.
The path to the beegfs-ctl
command can be configured with the beegfs_path
option
in the configuration.
When using the exclude_metrics
option, the excluded metrics are summed as other
.
Important: The metrics listed below, are similar to the naming of BeeGFS. The Collector prefixes these with beegfs_cstorage_
(beegfs client meta).
For example beegfs metric open
-> beegfs_cstorage_
Note: BeeGFS FS offers many Metadata Information. Probably it makes sense to exlcude most of them. Nevertheless, these excluded metrics will be summed as beegfs_cstorage_other
.
Available Metrics:
- “sum”
- “ack”
- “sChDrct”
- “getFSize”
- “sAttr”
- “statfs”
- “trunc”
- “close”
- “fsync”
- “ops-rd”
- “MiB-rd/s”
- “ops-wr”
- “MiB-wr/s”
- “endbg”
- “hrtbeat”
- “remNode”
- “storInf”
- “unlnk”
The collector adds a filesystem
tag to all metrics
6 - customcmd collector
Toplevel customCmdMetric
customcmd
collector
"customcmd": {
"exclude_metrics": [
"mymetric"
],
"files" : [
"/var/run/myapp.metrics"
],
"commands" : [
"/usr/local/bin/getmetrics.pl"
]
}
The customcmd
collector reads data from files and the output of executed commands. The files and commands can output multiple metrics (separated by newline) but the have to be in the InfluxDB line protocol. If a metric is not parsable, it is skipped. If a metric is not required, it can be excluded from forwarding it to the sink.
7 - diskstat collector
Toplevel diskstatMetric
diskstat
collector
"diskstat": {
"exclude_metrics": [
"disk_total"
],
}
The diskstat
collector reads data from /proc/self/mounts
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics per device (with device
tag):
disk_total
(unit GBytes
)disk_free
(unit GBytes
)
Global metrics:
part_max_used
(unit percent
)
8 - gpfs collector
Toplevel gpfsMetric
gpfs
collector
"ibstat": {
"mmpmon_path": "/path/to/mmpmon",
"exclude_filesystem": [
"fs1"
],
"send_bandwidths": true,
"send_total_values": true
}
The gpfs
collector uses the mmpmon
command to read performance metrics for
GPFS / IBM Spectrum Scale filesystems.
The reported filesystems can be filtered with the exclude_filesystem
option
in the configuration.
The path to the mmpmon
command can be configured with the mmpmon_path
option
in the configuration. If nothing is set, the collector searches in $PATH
for mmpmon
.
Metrics:
gpfs_bytes_read
gpfs_bytes_written
gpfs_num_opens
gpfs_num_closes
gpfs_num_reads
gpfs_num_writes
gpfs_num_readdirs
gpfs_num_inode_updates
gpfs_bytes_total = gpfs_bytes_read + gpfs_bytes_written
(if send_total_values == true
)gpfs_iops = gpfs_num_reads + gpfs_num_writes
(if send_total_values == true
)gpfs_metaops = gpfs_num_inode_updates + gpfs_num_closes + gpfs_num_opens + gpfs_num_readdirs
(if send_total_values == true
)gpfs_bw_read
(if send_bandwidths == true
)gpfs_bw_write
(if send_bandwidths == true
)
The collector adds a filesystem
tag to all metrics
9 - ibstat collector
Toplevel infinibandMetric
ibstat
collector
"ibstat": {
"exclude_devices": [
"mlx4"
],
"send_abs_values": true,
"send_derived_values": true
}
The ibstat
collector includes all Infiniband devices that can be
found below /sys/class/infiniband/
and where any of the ports provides a
LID file (/sys/class/infiniband/<dev>/ports/<port>/lid
)
The devices can be filtered with the exclude_devices
option in the configuration.
For each found LID the collector reads data through the sysfs files below /sys/class/infiniband/<device>
. (See: https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-class-infiniband)
Metrics:
ib_recv
ib_xmit
ib_recv_pkts
ib_xmit_pkts
ib_total = ib_recv + ib_xmit
(if send_total_values == true
)ib_total_pkts = ib_recv_pkts + ib_xmit_pkts
(if send_total_values == true
)ib_recv_bw
(if send_derived_values == true
)ib_xmit_bw
(if send_derived_values == true
)ib_recv_pkts_bw
(if send_derived_values == true
)ib_xmit_pkts_bw
(if send_derived_values == true
)
The collector adds a device
tag to all metrics
11 - ipmistat collector
Toplevel ipmiMetric
ipmistat
collector
"ipmistat": {
"ipmitool_path": "/path/to/ipmitool",
"ipmisensors_path": "/path/to/ipmi-sensors",
}
The ipmistat
collector reads data from ipmitool
(ipmitool sensor
) or ipmi-sensors
(ipmi-sensors --sdr-cache-recreate --comma-separated-output
).
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.
12 - likwid collector
Toplevel likwidMetric
likwid
collector
The likwid
collector is probably the most complicated collector. The LIKWID library is included as static library with direct access mode. The direct access mode is suitable if the daemon is executed by a root user. The static library does not contain the performance groups, so all information needs to be provided in the configuration.
"likwid": {
"force_overwrite" : false,
"invalid_to_zero" : false,
"liblikwid_path" : "/path/to/liblikwid.so",
"accessdaemon_path" : "/folder/that/contains/likwid-accessD",
"access_mode" : "direct or accessdaemon or perf_event",
"lockfile_path" : "/var/run/likwid.lock",
"eventsets": [
{
"events" : {
"COUNTER0": "EVENT0",
"COUNTER1": "EVENT1"
},
"metrics" : [
{
"name": "sum_01",
"calc": "COUNTER0 + COUNTER1",
"publish": false,
"unit": "myunit",
"type": "hwthread"
}
]
}
],
"globalmetrics" : [
{
"name": "global_sum",
"calc": "sum_01",
"publish": true,
"unit": "myunit",
"type": "hwthread"
}
]
}
The likwid
configuration consists of two parts, the eventsets
and globalmetrics
:
- An event set list itself has two parts, the
events
and a set of derivable metrics
. Each of the events
is a counter:event
pair in LIKWID’s syntax. The metrics
are a list of formulas to derive the metric value from the measurements of the events
’ values. Each metric has a name, the formula, a type and a publish flag. There is an optional unit
field. Counter names can be used like variables in the formulas, so PMC0+PMC1
sums the measurements for the both events configured in the counters PMC0
and PMC1
. You can optionally use time
for the measurement time and inverseClock
for 1.0/baseCpuFrequency
. The type tells the LikwidCollector whether it is a metric for each hardware thread (cpu
) or each CPU socket (socket
). You may specify a unit for the metric with unit
. The last one is the publishing flag. It tells the LikwidCollector whether a metric should be sent to the router or is only used internally to compute a global metric. - The
globalmetrics
are metrics which require data from multiple event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a type and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. Also time
and inverseClock
cannot be used anymore. So, the idea is to derive a metric in the eventsets
section and reuse it in the globalmetrics
part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics ("publish": false
). Be aware that the combination might be misleading because the “behavior” of a metric changes over time and the multiple measurements might count different computing phases. Similar to the metrics in the eventset, you can specify a metric unit with the unit
field.
Additional options:
force_overwrite
: Same as setting LIKWID_FORCE=1
. In case counters are already in-use, LIKWID overwrites their configuration to do its measurementsinvalid_to_zero
: In some cases, the calculations result in NaN
or Inf
. With this option, all NaN
and Inf
values are replaces with 0.0
. See below in seperate sectionaccess_mode
: Specify LIKWID access mode: direct
for direct register access as root user or accessdaemon
. The access mode perf_event
is current untested.accessdaemon_path
: Folder of the accessDaemon likwid-accessD
(like /usr/local/sbin
)liblikwid_path
: Location of liblikwid.so
including file name like /usr/local/lib/liblikwid.so
lockfile_path
: Location of LIKWID’s lock file if multiple tools should access the hardware counters. Default /var/run/likwid.lock
Available metric types
Hardware performance counters are scattered all over the system nowadays. A counter coveres a specific part of the system. While there are hardware thread specific counter for CPU cycles, instructions and so on, some others are specific for a whole CPU socket/package. To address that, the LikwidCollector provides the specification of a type
for each metric.
hwthread
: One metric per CPU hardware thread with the tags "type" : "hwthread"
and "type-id" : "$hwthread_id"
socket
: One metric per CPU socket/package with the tags "type" : "socket"
and "type-id" : "$socket_id"
Note: You cannot specify socket
type for a metric that is measured at hwthread
type, so some kind of expert knowledge or lookup work in the Likwid Wiki is required. Get the type of each counter from the Architecture pages and as soon as one counter in a metric is socket-specific, the whole metric is socket-specific.
As a guideline:
- All counters
FIXCx
, PMCy
and TMAz
have the type hwthread
- All counters names containing
BOX
have the type socket
- All
PWRx
counters have type socket
, except "PWR1" : "RAPL_CORE_ENERGY"
has hwthread
type - All
DFCx
counters have type socket
Help with the configuration
The configuration for the likwid
collector is quite complicated. Most users don’t use LIKWID with the event:counter notation but rely on the performance groups defined by the LIKWID team for each architecture. In order to help with the likwid
collector configuration, we included a script scripts/likwid_perfgroup_to_cc_config.py
that creates the configuration of an eventset
from a performance group (using a LIKWID installation in $PATH
):
$ likwid-perfctr -i
[...]
short name: ICX
[...]
$ likwid-perfctr -a
[...]
MEM_DP
MEM
FLOPS_SP
CLOCK
[...]
$ scripts/likwid_perfgroup_to_cc_config.py ICX MEM_DP
{
"events": {
"FIXC0": "INSTR_RETIRED_ANY",
"FIXC1": "CPU_CLK_UNHALTED_CORE",
"..." : "..."
},
"metrics" : [
{
"calc": "time",
"name": "Runtime (RDTSC) [s]",
"publish": true,
"unit": "seconds"
"type": "hwthread"
},
{
"..." : "..."
}
]
}
You can copy this JSON and add it to the eventsets
list. If you specify multiple event sets, you can add globally derived metrics in the extra global_metrics
section with the metric names as variables.
Mixed usage between daemon and users
LIKWID checks the file /var/run/likwid.lock
before performing any interfering operations. Who is allowed to access the counters is determined by the owner of the file. If it does not exist, it is created for the current user. So, if you want to temporarly allow counter access to a user (e.g. in a job):
Before (SLURM prolog, …)
chown $JOBUSER /var/run/likwid.lock
After (SLURM epilog, …)
chown $CCUSER /var/run/likwid.lock
invalid_to_zero
option
In some cases LIKWID returns 0.0
for some events that are further used in processing and maybe used as divisor in a calculation. After evaluation of a metric, the result might be NaN
or +-Inf
. These resulting metrics are commonly not created and forwarded to the router because the InfluxDB line protocol does not support these special floating-point values. If you want to have them sent, this option forces these metric values to be 0.0
instead.
One might think this does not happen often but often used metrics in the world of performance engineering like Instructions-per-Cycle (IPC) or more frequently the actual CPU clock are derived with events like CPU_CLK_UNHALTED_CORE
(Intel) which do not increment in halted state (as the name implies). In there are different power management systems in a chip which can cause a hardware thread to go in such a state. Moreover, if no cycles are executed by the core, also many other events are not incremented as well (like INSTR_RETIRED_ANY
for retired instructions and part of IPC).
lockfile_path
option
LIKWID can be configured with a lock file with which the access to the performance monitoring registers can be disabled (only the owner of the lock file is allowed to access the registers). When the lockfile_path
option is set, the collector subscribes to changes to this file to stop monitoring if the owner of the lock file changes. This feature is useful when users should be able to perform own hardware performance counter measurements through LIKWID or any other tool.
send_*_total values
option
send_core_total_values
: Metrics, which are usually collected on a per hardware thread basis, are additionally summed up per CPU core.send_socket_total_values
Metrics, which are usually collected on a per hardware thread basis, are additionally summed up per CPU socket.send_node_total_values
Metrics, which are usually collected on a per hardware thread basis, are additionally summed up per node.
Example configuration
AMD Zen3
"likwid": {
"force_overwrite" : false,
"invalid_to_zero" : false,
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"type": "hwthread",
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"unit": "MFlops/s",
"type": "hwthread",
"publish": true
},
{
"name": "clock",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"type": "hwthread",
"unit": "MHz",
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"unit": "Mbyte/s",
"type": "socket",
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"unit": "Watt"
"type": "socket",
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"type": "socket",
"unit": "Watt"
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"unit": "Mbyte/s",
"type": "socket",
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"type": "socket",
"unit": "Mbyte/s",
"publish": true
}
]
}
How to get the eventsets and metrics from LIKWID
The likwid
collector reads hardware performance counters at a hwthread and socket level. The configuration looks quite complicated but it is basically copy&paste from LIKWID’s performance groups. The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
The metrics are following the same procedure:
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "type": "hwthread",
-> "publish": true
-> }
-> ]
The script scripts/likwid_perfgroup_to_cc_config.py
might help you.
14 - lustrestat collector
Toplevel lustreMetric
lustrestat
collector
"lustrestat": {
"lctl_command": "/path/to/lctl",
"exclude_metrics": [
"setattr",
"getattr"
],
"send_abs_values" : true,
"send_derived_values" : true,
"send_diff_values": true,
"use_sudo": false
}
The lustrestat
collector uses the lctl
application with the get_param
option to get all llite
metrics (Lustre client). The llite
metrics are only available for root users. If password-less sudo is configured, you can enable sudo
in the configuration.
Metrics:
lustre_read_bytes
(unit bytes
)lustre_read_requests
(unit requests
)lustre_write_bytes
(unit bytes
)lustre_write_requests
(unit requests
)lustre_open
lustre_close
lustre_getattr
lustre_setattr
lustre_statfs
lustre_inode_permission
lustre_read_bw
(if send_derived_values == true
, unit bytes/sec
)lustre_write_bw
(if send_derived_values == true
, unit bytes/sec
)lustre_read_requests_rate
(if send_derived_values == true
, unit requests/sec
)lustre_write_requests_rate
(if send_derived_values == true
, unit requests/sec
)lustre_read_bytes_diff
(if send_diff_values == true
, unit bytes
)lustre_read_requests_diff
(if send_diff_values == true
, unit requests
)lustre_write_bytes_diff
(if send_diff_values == true
, unit bytes
)lustre_write_requests_diff
(if send_diff_values == true
, unit requests
)lustre_open_diff
(if send_diff_values == true
)lustre_close_diff
(if send_diff_values == true
)lustre_getattr_diff
(if send_diff_values == true
)lustre_setattr_diff
(if send_diff_values == true
)lustre_statfs_diff
(if send_diff_values == true
)lustre_inode_permission_diff
(if send_diff_values == true
)
This collector adds an device
tag.
16 - netstat collector
Toplevel netstatMetric
netstat
collector
"netstat": {
"include_devices": [
"eth0"
],
"send_abs_values" : true,
"send_derived_values" : true
}
The netstat
collector reads data from /proc/net/dev
and outputs a handful node metrics. With the include_devices
list you can specify which network devices should be measured. Note: Most other collectors use an exclude list instead of an include list.
Metrics:
net_bytes_in
(unit=bytes
)net_bytes_out
(unit=bytes
)net_pkts_in
(unit=packets
)net_pkts_out
(unit=packets
)net_bytes_in_bw
(unit=bytes/sec
if send_derived_values == true
)net_bytes_out_bw
(unit=bytes/sec
if send_derived_values == true
)net_pkts_in_bw
(unit=packets/sec
if send_derived_values == true
)net_pkts_out_bw
(unit=packets/sec
if send_derived_values == true
)
The device name is added as tag stype=network,stype-id=<device>
.
19 - nfsiostat collector
Toplevel nfsiostatMetric
nfsiostat
collector
"nfsiostat": {
"exclude_metrics": [
"nfsio_oread"
],
"exclude_filesystems" : [
"/mnt",
],
"use_server_as_stype": false
}
The nfsiostat
collector reads data from /proc/self/mountstats
and outputs a handful node metrics for each NFS filesystem. If a metric or filesystem is not required, it can be excluded from forwarding it to the sink.
Metrics:
nfsio_nread
: Bytes transferred by normal read()
callsnfsio_nwrite
: Bytes transferred by normal write()
callsnfsio_oread
: Bytes transferred by read()
calls with O_DIRECT
nfsio_owrite
: Bytes transferred by write()
calls with O_DIRECT
nfsio_pageread
: Pages transferred by read()
callsnfsio_pagewrite
: Pages transferred by write()
callsnfsio_nfsread
: Bytes transferred for reading from the servernfsio_nfswrite
: Pages transferred by writing to the server
The nfsiostat
collector adds the mountpoint to the tags as stype=filesystem,stype-id=<mountpoint>
. If the server address should be used instead of the mountpoint, use the use_server_as_stype
config setting.
21 - nvidia collector
Toplevel nvidiaMetric
nvidia
collector
"nvidia": {
"exclude_devices": [
"0","1", "0000000:ff:01.0"
],
"exclude_metrics": [
"nv_fb_mem_used",
"nv_fan"
],
"process_mig_devices": false,
"use_pci_info_as_type_id": true,
"add_pci_info_tag": false,
"add_uuid_meta": false,
"add_board_number_meta": false,
"add_serial_meta": false,
"use_uuid_for_mig_device": false,
"use_slice_for_mig_device": false
}
The nvidia
collector can be configured to leave out specific devices with the exclude_devices
option. It takes IDs as supplied to the NVML with nvmlDeviceGetHandleByIndex()
or the PCI address in NVML format (%08X:%02X:%02X.0
). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the exclude_metrics
option. Commonly only the physical GPUs are monitored. If MIG devices should be analyzed as well, set process_mig_devices
(adds stype=mig,stype-id=<mig_index>
). With the options use_uuid_for_mig_device
and use_slice_for_mig_device
, the <mig_index>
can be replaced with the UUID (e.g. MIG-6a9f7cc8-6d5b-5ce0-92de-750edc4d8849
) or the MIG slice name (e.g. 1g.5gb
).
The metrics sent by the nvidia
collector use accelerator
as type
tag. For the type-id
, it uses the device handle index by default. With the use_pci_info_as_type_id
option, the PCI ID is used instead. If both values should be added as tags, activate the add_pci_info_tag
option. It uses the device handle index as type-id
and adds the PCI ID as separate pci_identifier
tag.
Optionally, it is possible to add the UUID, the board part number and the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).
Metrics:
nv_util
nv_mem_util
nv_fb_mem_total
nv_fb_mem_used
nv_bar1_mem_total
nv_bar1_mem_used
nv_temp
nv_fan
nv_ecc_mode
nv_perf_state
nv_power_usage
nv_graphics_clock
nv_sm_clock
nv_mem_clock
nv_video_clock
nv_max_graphics_clock
nv_max_sm_clock
nv_max_mem_clock
nv_max_video_clock
nv_ecc_uncorrected_error
nv_ecc_corrected_error
nv_power_max_limit
nv_encoder_util
nv_decoder_util
nv_remapped_rows_corrected
nv_remapped_rows_uncorrected
nv_remapped_rows_pending
nv_remapped_rows_failure
nv_compute_processes
nv_graphics_processes
nv_violation_power
nv_violation_thermal
nv_violation_sync_boost
nv_violation_board_limit
nv_violation_low_util
nv_violation_reliability
nv_violation_below_app_clock
nv_violation_below_base_clock
nv_nvlink_crc_flit_errors
nv_nvlink_crc_errors
nv_nvlink_ecc_errors
nv_nvlink_replay_errors
nv_nvlink_recovery_errors
Some metrics add the additional sub type tag (stype
) like the nv_nvlink_*
metrics set stype=nvlink,stype-id=<link_number>
.
23 - rocm_smi collector
Toplevel rocmsmiMetric
rocm_smi
collector
"rocm_smi": {
"exclude_devices": [
"0","1", "0000000:ff:01.0"
],
"exclude_metrics": [
"rocm_mm_util",
"rocm_temp_vrsoc"
],
"use_pci_info_as_type_id": true,
"add_pci_info_tag": false,
"add_serial_meta": false,
}
The rocm_smi
collector can be configured to leave out specific devices with the exclude_devices
option. It takes logical IDs in the list of available devices or the PCI address similar to NVML format (%08X:%02X:%02X.0
). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the exclude_metrics
option.
The metrics sent by the rocm_smi
collector use accelerator
as type
tag. For the type-id
, it uses the device handle index by default. With the use_pci_info_as_type_id
option, the PCI ID is used instead. If both values should be added as tags, activate the add_pci_info_tag
option. It uses the device handle index as type-id
and adds the PCI ID as separate pci_identifier
tag.
Optionally, it is possible to add the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).
Metrics:
rocm_gfx_util
rocm_umc_util
rocm_mm_util
rocm_avg_power
rocm_temp_mem
rocm_temp_hotspot
rocm_temp_edge
rocm_temp_vrgfx
rocm_temp_vrsoc
rocm_temp_vrmem
rocm_gfx_clock
rocm_soc_clock
rocm_u_clock
rocm_v0_clock
rocm_v1_clock
rocm_d0_clock
rocm_d1_clock
rocm_temp_hbm
Some metrics add the additional sub type tag (stype
) like the rocm_temp_hbm
metrics set stype=device,stype-id=<HBM_slice_number>
.
25 - self collector
Toplevel selfMetric
self
collector
"self": {
"read_mem_stats" : true,
"read_goroutines" : true,
"read_cgo_calls" : true,
"read_rusage" : true
}
The self
collector reads the data from the runtime
and syscall
packages, so monitors the execution of the cc-metric-collector itself.
Metrics:
- If
read_mem_stats == true
:total_alloc
: The metric reports cumulative bytes allocated for heap objects.heap_alloc
: The metric reports bytes of allocated heap objects.heap_sys
: The metric reports bytes of heap memory obtained from the OS.heap_idle
: The metric reports bytes in idle (unused) spans.heap_inuse
: The metric reports bytes in in-use spans.heap_released
: The metric reports bytes of physical memory returned to the OS.heap_objects
: The metric reports the number of allocated heap objects.
- If
read_goroutines == true
:num_goroutines
: The metric reports the number of goroutines that currently exist.
- If
read_cgo_calls == true
:num_cgo_calls
: The metric reports the number of cgo calls made by the current process.
- If
read_rusage == true
:rusage_user_time
: The metric reports the amount of time that this process has been scheduled in user mode.rusage_system_time
: The metric reports the amount of time that this process has been scheduled in kernel mode.rusage_vol_ctx_switch
: The metric reports the amount of voluntary context switches.rusage_invol_ctx_switch
: The metric reports the amount of involuntary context switches.rusage_signals
: The metric reports the number of signals received.rusage_major_pgfaults
: The metric reports the number of major faults the process has made which have required loading a memory page from disk.rusage_minor_pgfaults
: The metric reports the number of minor faults the process has made which have not required loading a memory page from disk.
27 - topprocs collector
Toplevel topprocsMetric
topprocs
collector
"topprocs": {
"num_procs": 5
}
The topprocs
collector reads the TopX processes (sorted by CPU utilization, ps -Ao comm --sort=-pcpu
).
In contrast to most other collectors, the metric value is a string
.