Skip to main content
For GPU runs, Lyceum Cloud records DCGM-sourced GPU metrics and system telemetry into Prometheus. The metrics endpoint queries that data per execution, so you can attribute time, debug stalls, and confirm a job actually used the GPU it was billed for.

Endpoint

GET /api/v2/external/execution/{execution_id}/metrics
QueryDefaultPurpose
startexecution start_timeISO 8601 start timestamp
endexecution end_time, or current timeISO 8601 end timestamp
step15sQuery resolution (e.g. 5s, 15s, 1m)
By default the endpoint returns the entire execution at 15-second resolution. Narrow the window or change the step when you’re zooming into a specific phase.

Available series

GPU (via DCGM)

MetricUnit
gpuUtilizationPercent0–100
gpuMemoryUtilizationPercent0–100
gpuTemperatureCelsius°C
gpuPowerWattW
gpuPowerLimitWattW
gpuClockSmMhzMHz
gpuClockMemMhzMHz
gpuPcieThroughputRxBytesPerSecbytes/sec
gpuPcieThroughputTxBytesPerSecbytes/sec

System

MetricUnit
systemRamTotalBytesbytes
systemRamUsedBytesbytes
systemCpuUsagePercent0–100

Example

curl "https://api.lyceum.technology/api/v2/external/execution/$EXEC/metrics?step=5s" \
  -H "Authorization: Bearer $LYCEUM_API_KEY" | jq

What it’s good for

  • Confirming GPU utilisation — if gpuUtilizationPercent is consistently low during the heavy phase of a run, you’re likely bottlenecked on data loading or CPU pre-processing
  • Diagnosing OOMsgpuMemoryUtilizationPercent climbing to 100% just before a crash points to a memory issue rather than a code bug
  • Power and thermal investigationsgpuPowerWatt against gpuPowerLimitWatt and gpuTemperatureCelsius show whether the GPU is throttling
  • Cost attribution — combined with the run’s wall-clock time, the metrics let you compute cost per GPU-hour-of-actual-work