Dedicated Inference

A dedicated deployment runs a model on reserved GPU replicas and exposes it through an OpenAI-compatible Chat Completions endpoint. You’re billed for the time the replicas are running, not per request. This is the right tool when you need predictable latency, full control over the model and the GPU it runs on, and the ability to point any OpenAI-compatible client at your own endpoint.

Concepts

Each deployment is identified by a deployment_id and is made up of:

A model — referenced by its Hugging Face model ID (e.g. meta-llama/Llama-3.1-8B-Instruct). For gated models you supply an HF token at create time.
A hardware profile — the GPU type each replica runs on
Replicas — one or more identical copies of the model serving requests. The replica count autoscales between min_replicas and max_replicas.
A scaling target — target_rps (requests per second per replica) and target_latency_p95_ms define when to scale up or down. A stabilisation window prevents flapping.

The platform exposes one OpenAI-compatible endpoint per deployment; load balancing across replicas is handled for you.

When to use what

Use case	Pick
Production traffic, predictable latency, your own model	Dedicated (this page)
One-off LLM job that runs and exits	A regular Python or Docker run
Pay-per-request, low-volume, no infra	Serverless inference (coming soon)
Distributed multi-node training of new models	Clusters (enterprise)

Deploying

CLI
REST API

lyceum infer deploy meta-llama/Llama-3.1-8B-Instruct \
  -g gpu.a100 \
  --min-replicas 1 \
  --max-replicas 3

Flag	Default	Description
`-g`, `--gpu`	`gpu.a100`	Hardware profile (e.g. `gpu.a100`, `gpu.h100`)
`--min-replicas`	`1`	Minimum replicas kept warm
`--max-replicas`	`1`	Maximum replicas the deployment can scale to
`--target-rps`	`10.0`	Target requests/sec per replica
`--target-latency`	`5000.0`	Target p95 latency in milliseconds
`--stabilisation`	`300`	Scale-down stabilisation window (seconds)
`-t`, `--hf-token`	—	Hugging Face token for gated models
`-w`, `--wait`	off	Block until the deployment has healthy replicas

lyceum infer status <deployment_id>      # add --all to include stopped deployments
lyceum infer stop <deployment_id>
lyceum infer models                      # public catalogue + your deployments
lyceum infer chat -d <deployment_id> -p "Hello"   # send a chat completion

POST /api/v2/external/inference/create

Required body fields: hf_model_id, hardware_profile, target_rps, target_latency_p95_ms, stabilisation_window. Optional: hf_token (for gated models), min_replicas and max_replicas (both default to 1).Returns deployment_id and initial status. The deployment starts in created and transitions through provisioning to running once the first replica is healthy. Poll GET /inference/get?deployment_id=<id> to track progress.

Method	Endpoint	Purpose
`POST`	`/inference/create`	Create a deployment
`GET`	`/inference/get?deployment_id=...`	Deployment details and replicas
`GET`	`/inference/list`	List your deployments
`DELETE`	`/inference/stop`	Stop a deployment

Calling your deployment

Deployments speak the OpenAI Chat Completions format, so any OpenAI-compatible client works. Set the model field to your deployment_id:

POST /api/v2/external/v1/chat/completions
Authorization: Bearer <api_key>
Content-Type: application/json

{
  "model": "<deployment_id>",
  "messages": [{"role": "user", "content": "Hello"}]
}

Response shape mirrors OpenAI’s: id, choices, usage, model, created.

Using the OpenAI SDK

Because the endpoint is OpenAI-compatible, you can drop in the official openai client by overriding base_url and api_key. Set the model field to your deployment_id:

from openai import OpenAI

client = OpenAI(
    api_key="lk_...",
    base_url="https://api.lyceum.technology/api/v2/external",
)

resp = client.chat.completions.create(
    model="<deployment_id>",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

Anything that speaks the OpenAI Chat Completions protocol — LangChain, LlamaIndex, Continue, your own client — works the same way: point it at the Lyceum base URL and use your API key.

Batch and streaming

For asynchronous batch processing on large input files, the platform implements the OpenAI Batch API:

Method	Endpoint	Purpose
`POST`	`/files`	Upload a JSONL input file
`POST`	`/batches`	Create a batch job referencing the file
`GET`	`/batches/{batch_id}`	Check status
`POST`	`/batches/{batch_id}/cancel`	Cancel a batch
`GET`	`/batches`	List your batches
`GET`	`/files/{file_id}/content`	Download results

For server-sent-events streaming of inference results, see Streaming Inference.

Active models

Inspect running deployments and replica health.

Getting Started

Serverless

Instances

Inference

Workloads

Observability

Tools

Configuration

Account

Concepts

When to use what

Deploying

Calling your deployment

Using the OpenAI SDK

Batch and streaming

Active models

Getting Started

Serverless

Instances

Inference

Workloads

Observability

Tools

Configuration

Account

​Concepts

​When to use what

​Deploying

​Calling your deployment

​Using the OpenAI SDK

​Batch and streaming

Active models

Concepts

When to use what

Deploying

Calling your deployment

Using the OpenAI SDK

Batch and streaming