Skip to main content
A dedicated deployment runs a model on reserved GPU replicas and exposes it through an OpenAI-compatible Chat Completions endpoint. You’re billed for the time the replicas are running, not per request. This is the right tool when you need predictable latency, full control over the model and the GPU it runs on, and the ability to point any OpenAI-compatible client at your own endpoint.

Concepts

Each deployment is identified by a deployment_id and is made up of:
  • A model — referenced by its Hugging Face model ID (e.g. meta-llama/Llama-3.1-8B-Instruct). For gated models you supply an HF token at create time.
  • A hardware profile — the GPU type each replica runs on
  • Replicas — one or more identical copies of the model serving requests. The replica count autoscales between min_replicas and max_replicas.
  • A scaling targettarget_rps (requests per second per replica) and target_latency_p95_ms define when to scale up or down. A stabilisation window prevents flapping.
The platform exposes one OpenAI-compatible endpoint per deployment; load balancing across replicas is handled for you.

When to use what

Use casePick
Production traffic, predictable latency, your own modelDedicated (this page)
One-off LLM job that runs and exitsA regular Python or Docker run
Pay-per-request, low-volume, no infraServerless inference (coming soon)
Distributed multi-node training of new modelsClusters (enterprise)

Deploying

lyceum infer deploy meta-llama/Llama-3.1-8B-Instruct \
  -g gpu.a100 \
  --min-replicas 1 \
  --max-replicas 3
FlagDefaultDescription
-g, --gpugpu.a100Hardware profile (e.g. gpu.a100, gpu.h100)
--min-replicas1Minimum replicas kept warm
--max-replicas1Maximum replicas the deployment can scale to
--target-rps10.0Target requests/sec per replica
--target-latency5000.0Target p95 latency in milliseconds
--stabilisation300Scale-down stabilisation window (seconds)
-t, --hf-tokenHugging Face token for gated models
-w, --waitoffBlock until the deployment has healthy replicas
lyceum infer status <deployment_id>      # add --all to include stopped deployments
lyceum infer stop <deployment_id>
lyceum infer models                      # public catalogue + your deployments
lyceum infer chat -d <deployment_id> -p "Hello"   # send a chat completion

Calling your deployment

Deployments speak the OpenAI Chat Completions format, so any OpenAI-compatible client works. Set the model field to your deployment_id:
POST /api/v2/external/v1/chat/completions
Authorization: Bearer <api_key>
Content-Type: application/json

{
  "model": "<deployment_id>",
  "messages": [{"role": "user", "content": "Hello"}]
}
Response shape mirrors OpenAI’s: id, choices, usage, model, created.

Using the OpenAI SDK

Because the endpoint is OpenAI-compatible, you can drop in the official openai client by overriding base_url and api_key. Set the model field to your deployment_id:
from openai import OpenAI

client = OpenAI(
    api_key="lk_...",
    base_url="https://api.lyceum.technology/api/v2/external",
)

resp = client.chat.completions.create(
    model="<deployment_id>",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
Anything that speaks the OpenAI Chat Completions protocol — LangChain, LlamaIndex, Continue, your own client — works the same way: point it at the Lyceum base URL and use your API key.

Batch and streaming

For asynchronous batch processing on large input files, the platform implements the OpenAI Batch API:
MethodEndpointPurpose
POST/filesUpload a JSONL input file
POST/batchesCreate a batch job referencing the file
GET/batches/{batch_id}Check status
POST/batches/{batch_id}/cancelCancel a batch
GET/batchesList your batches
GET/files/{file_id}/contentDownload results
For server-sent-events streaming of inference results, see Streaming Inference.

Active models

Inspect running deployments and replica health.