Serverless Inference

How it differs from dedicated
Request early access

Serverless inference is coming soon. Use Dedicated Inference for production workloads in the meantime.

Serverless inference will offer pay-per-request pricing with no idle compute cost — the right choice for low-volume or spiky traffic where reserving GPU replicas would be wasteful. The trade-off is cold starts. Serverless endpoints scale from zero, so the first request after a quiet period waits for a container to warm up. For workloads that need consistent sub-100ms first-token latency, dedicated deployments will remain the better fit.

How it differs from dedicated

	Dedicated	Serverless
Billing	Per replica-hour	Per request
Idle cost	Yes — replicas stay warm	No
Cold-start latency	None (after first replica is healthy)	Yes, for the first request after idle
Best for	Production traffic, steady load	Sporadic or low-volume traffic

Request early access

Email team@lyceum.technology to be notified when serverless inference becomes available.

Streaming Inference Python Execution

Getting Started

Serverless

Instances

Inference

Workloads

Observability

Tools

Configuration

Account

How it differs from dedicated

Request early access

Getting Started

Serverless

Instances

Inference

Workloads

Observability

Tools

Configuration

Account

​How it differs from dedicated

​Request early access

How it differs from dedicated

Request early access