GPU & System Metrics - Lyceum Documentation

For GPU runs, Lyceum Cloud records DCGM-sourced GPU metrics and system telemetry into Prometheus. The metrics endpoint queries that data per execution, so you can attribute time, debug stalls, and confirm a job actually used the GPU it was billed for.

Endpoint

GET /api/v2/external/execution/{execution_id}/metrics

Query	Default	Purpose
`start`	execution `start_time`	ISO 8601 start timestamp
`end`	execution `end_time`, or current time	ISO 8601 end timestamp
`step`	`15s`	Query resolution (e.g. `5s`, `15s`, `1m`)

By default the endpoint returns the entire execution at 15-second resolution. Narrow the window or change the step when you’re zooming into a specific phase.

Available series

GPU (via DCGM)

Metric	Unit
`gpuUtilizationPercent`	0–100
`gpuMemoryUtilizationPercent`	0–100
`gpuTemperatureCelsius`	°C
`gpuPowerWatt`	W
`gpuPowerLimitWatt`	W
`gpuClockSmMhz`	MHz
`gpuClockMemMhz`	MHz
`gpuPcieThroughputRxBytesPerSec`	bytes/sec
`gpuPcieThroughputTxBytesPerSec`	bytes/sec

System

Metric	Unit
`systemRamTotalBytes`	bytes
`systemRamUsedBytes`	bytes
`systemCpuUsagePercent`	0–100

Example

curl "https://api.lyceum.technology/api/v2/external/execution/$EXEC/metrics?step=5s" \
  -H "Authorization: Bearer $LYCEUM_API_KEY" | jq

What it’s good for

Confirming GPU utilisation, if gpuUtilizationPercent is consistently low during the heavy phase of a run, you’re likely bottlenecked on data loading or CPU pre-processing
Diagnosing OOMs, gpuMemoryUtilizationPercent climbing to 100% just before a crash points to a memory issue rather than a code bug
Power and thermal investigations, gpuPowerWatt against gpuPowerLimitWatt and gpuTemperatureCelsius show whether the GPU is throttling
Cost attribution, combined with the run’s wall-clock time, the metrics let you compute cost per GPU-hour-of-actual-work

​Endpoint

​Available series

​GPU (via DCGM)

​System

​Example

​What it’s good for