Serverless inference offers pay-per-request pricing with no idle compute cost — the right choice for low-volume or spiky traffic where reserving GPU replicas would be wasteful. The trade-off is cold starts. Serverless endpoints scale from zero, so the first request after a quiet period waits for a container to warm up. For workloads that need consistent sub-100ms first-token latency, dedicated deployments are the better fit.Documentation Index
Fetch the complete documentation index at: https://docs.lyceum.technology/llms.txt
Use this file to discover all available pages before exploring further.
How it differs from dedicated
| Dedicated | Serverless | |
|---|---|---|
| Billing | Per replica-hour | Per request |
| Idle cost | Yes — replicas stay warm | No |
| Cold-start latency | None (after first replica is healthy) | Yes, for the first request after idle |
| Best for | Production traffic, steady load | Sporadic or low-volume traffic |

