Use this file to discover all available pages before exploring further.
A dedicated deployment runs a model on reserved GPU replicas and exposes it through an OpenAI-compatible Chat Completions endpoint. You’re billed for the time the replicas are running, not per request.This is the right tool when you need predictable latency, full control over the model and the GPU it runs on, and the ability to point any OpenAI-compatible client at your own endpoint.
Each deployment is identified by a deployment_id and is made up of:
A model — referenced by its Hugging Face model ID (e.g. meta-llama/Llama-3.1-8B-Instruct). For gated models you supply an HF token at create time.
A hardware profile — the GPU type each replica runs on
Replicas — one or more identical copies of the model serving requests. The replica count autoscales between min_replicas and max_replicas.
A scaling target — target_rps (requests per second per replica) and target_latency_p95_ms define when to scale up or down. A stabilisation window prevents flapping.
The platform exposes one OpenAI-compatible endpoint per deployment; load balancing across replicas is handled for you.
Deployments that receive no requests for 1 hour are automatically scaled down to zero instances and marked as paused. This stops billing for idle replicas.When a paused deployment receives a new request, it is automatically resumed — scaled back to its min_replicas — and the autoscaler takes over from there, scaling up or down based on traffic volume. The first request after a pause will have higher latency while the replicas start up.Scale to zero is enabled for all deployments. There is nothing to configure — it works alongside the existing autoscaler settings.
lyceum infer status <deployment_id> # add --all to include stopped deploymentslyceum infer stop <deployment_id>lyceum infer models # public catalogue + your deploymentslyceum infer chat -d <deployment_id> -p "Hello" # send a chat completion
POST /api/v2/external/inference/create
Required body fields: hf_model_id, hardware_profile, target_rps, target_latency_p95_ms, stabilisation_window. Optional: hf_token (for gated models), min_replicas and max_replicas (both default to 1).Returns deployment_id and initial status. The deployment starts in created and transitions through provisioning to running once the first replica is healthy. Poll GET /inference/get?deployment_id=<id> to track progress.
Because the endpoint is OpenAI-compatible, you can drop in the official openai client by overriding base_url and api_key. Set the model field to your deployment_id:
Anything that speaks the OpenAI Chat Completions protocol — LangChain, LlamaIndex, Continue, your own client — works the same way: point it at the Lyceum base URL and use your API key.