Skip to main content
Serverless inference is coming soon. Use Dedicated Inference for production workloads in the meantime.
Serverless inference will offer pay-per-request pricing with no idle compute cost — the right choice for low-volume or spiky traffic where reserving GPU replicas would be wasteful. The trade-off is cold starts. Serverless endpoints scale from zero, so the first request after a quiet period waits for a container to warm up. For workloads that need consistent sub-100ms first-token latency, dedicated deployments will remain the better fit.

How it differs from dedicated

DedicatedServerless
BillingPer replica-hourPer request
Idle costYes — replicas stay warmNo
Cold-start latencyNone (after first replica is healthy)Yes, for the first request after idle
Best forProduction traffic, steady loadSporadic or low-volume traffic

Request early access

Email team@lyceum.technology to be notified when serverless inference becomes available.