Concepts
Each deployment is identified by adeployment_id and is made up of:
- A model — referenced by its Hugging Face model ID (e.g.
meta-llama/Llama-3.1-8B-Instruct). For gated models you supply an HF token at create time. - A hardware profile — the GPU type each replica runs on
- Replicas — one or more identical copies of the model serving requests. The replica count autoscales between
min_replicasandmax_replicas. - A scaling target —
target_rps(requests per second per replica) andtarget_latency_p95_msdefine when to scale up or down. A stabilisation window prevents flapping.
When to use what
| Use case | Pick |
|---|---|
| Production traffic, predictable latency, your own model | Dedicated (this page) |
| One-off LLM job that runs and exits | A regular Python or Docker run |
| Pay-per-request, low-volume, no infra | Serverless inference (coming soon) |
| Distributed multi-node training of new models | Clusters (enterprise) |
Deploying
- CLI
- REST API
| Flag | Default | Description |
|---|---|---|
-g, --gpu | gpu.a100 | Hardware profile (e.g. gpu.a100, gpu.h100) |
--min-replicas | 1 | Minimum replicas kept warm |
--max-replicas | 1 | Maximum replicas the deployment can scale to |
--target-rps | 10.0 | Target requests/sec per replica |
--target-latency | 5000.0 | Target p95 latency in milliseconds |
--stabilisation | 300 | Scale-down stabilisation window (seconds) |
-t, --hf-token | — | Hugging Face token for gated models |
-w, --wait | off | Block until the deployment has healthy replicas |
Calling your deployment
Deployments speak the OpenAI Chat Completions format, so any OpenAI-compatible client works. Set themodel field to your deployment_id:
id, choices, usage, model, created.
Using the OpenAI SDK
Because the endpoint is OpenAI-compatible, you can drop in the officialopenai client by overriding base_url and api_key. Set the model field to your deployment_id:
Batch and streaming
For asynchronous batch processing on large input files, the platform implements the OpenAI Batch API:| Method | Endpoint | Purpose |
|---|---|---|
POST | /files | Upload a JSONL input file |
POST | /batches | Create a batch job referencing the file |
GET | /batches/{batch_id} | Check status |
POST | /batches/{batch_id}/cancel | Cancel a batch |
GET | /batches | List your batches |
GET | /files/{file_id}/content | Download results |
Active models
Inspect running deployments and replica health.

