Documentation Index
Fetch the complete documentation index at: https://docs.lyceum.technology/llms.txt
Use this file to discover all available pages before exploring further.
For GPU runs, Lyceum Cloud records DCGM-sourced GPU metrics and system telemetry into Prometheus. The metrics endpoint queries that data per execution, so you can attribute time, debug stalls, and confirm a job actually used the GPU it was billed for.
Endpoint
GET /api/v2/external/execution/{execution_id}/metrics
| Query | Default | Purpose |
|---|
start | execution start_time | ISO 8601 start timestamp |
end | execution end_time, or current time | ISO 8601 end timestamp |
step | 15s | Query resolution (e.g. 5s, 15s, 1m) |
By default the endpoint returns the entire execution at 15-second resolution. Narrow the window or change the step when you’re zooming into a specific phase.
Available series
GPU (via DCGM)
| Metric | Unit |
|---|
gpuUtilizationPercent | 0–100 |
gpuMemoryUtilizationPercent | 0–100 |
gpuTemperatureCelsius | °C |
gpuPowerWatt | W |
gpuPowerLimitWatt | W |
gpuClockSmMhz | MHz |
gpuClockMemMhz | MHz |
gpuPcieThroughputRxBytesPerSec | bytes/sec |
gpuPcieThroughputTxBytesPerSec | bytes/sec |
System
| Metric | Unit |
|---|
systemRamTotalBytes | bytes |
systemRamUsedBytes | bytes |
systemCpuUsagePercent | 0–100 |
Example
curl "https://api.lyceum.technology/api/v2/external/execution/$EXEC/metrics?step=5s" \
-H "Authorization: Bearer $LYCEUM_API_KEY" | jq
What it’s good for
- Confirming GPU utilisation — if
gpuUtilizationPercent is consistently low during the heavy phase of a run, you’re likely bottlenecked on data loading or CPU pre-processing
- Diagnosing OOMs —
gpuMemoryUtilizationPercent climbing to 100% just before a crash points to a memory issue rather than a code bug
- Power and thermal investigations —
gpuPowerWatt against gpuPowerLimitWatt and gpuTemperatureCelsius show whether the GPU is throttling
- Cost attribution — combined with the run’s wall-clock time, the metrics let you compute cost per GPU-hour-of-actual-work