Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lyceum.technology/llms.txt

Use this file to discover all available pages before exploring further.

Serverless inference offers pay-per-request pricing with no idle compute cost — the right choice for low-volume or spiky traffic where reserving GPU replicas would be wasteful. The trade-off is cold starts. Serverless endpoints scale from zero, so the first request after a quiet period waits for a container to warm up. For workloads that need consistent sub-100ms first-token latency, dedicated deployments are the better fit.

How it differs from dedicated

DedicatedServerless
BillingPer replica-hourPer request
Idle costYes — replicas stay warmNo
Cold-start latencyNone (after first replica is healthy)Yes, for the first request after idle
Best forProduction traffic, steady loadSporadic or low-volume traffic