Endpoint
| Query | Default | Purpose |
|---|---|---|
start | execution start_time | ISO 8601 start timestamp |
end | execution end_time, or current time | ISO 8601 end timestamp |
step | 15s | Query resolution (e.g. 5s, 15s, 1m) |
Available series
GPU (via DCGM)
| Metric | Unit |
|---|---|
gpuUtilizationPercent | 0–100 |
gpuMemoryUtilizationPercent | 0–100 |
gpuTemperatureCelsius | °C |
gpuPowerWatt | W |
gpuPowerLimitWatt | W |
gpuClockSmMhz | MHz |
gpuClockMemMhz | MHz |
gpuPcieThroughputRxBytesPerSec | bytes/sec |
gpuPcieThroughputTxBytesPerSec | bytes/sec |
System
| Metric | Unit |
|---|---|
systemRamTotalBytes | bytes |
systemRamUsedBytes | bytes |
systemCpuUsagePercent | 0–100 |
Example
What it’s good for
- Confirming GPU utilisation — if
gpuUtilizationPercentis consistently low during the heavy phase of a run, you’re likely bottlenecked on data loading or CPU pre-processing - Diagnosing OOMs —
gpuMemoryUtilizationPercentclimbing to 100% just before a crash points to a memory issue rather than a code bug - Power and thermal investigations —
gpuPowerWattagainstgpuPowerLimitWattandgpuTemperatureCelsiusshow whether the GPU is throttling - Cost attribution — combined with the run’s wall-clock time, the metrics let you compute cost per GPU-hour-of-actual-work

