Skip to main content
POST
/
api
/
v2
/
external
/
inference
/
create
Create Dedicated Deployment
curl --request POST \
  --url https://api.example.com/api/v2/external/inference/create \
  --header 'Content-Type: application/json' \
  --data '
{
  "hf_model_id": "<string>",
  "target_rps": 123,
  "target_latency_p95_ms": 123,
  "stabilisation_window": 123,
  "gpu_type": "<string>",
  "hf_token": "<string>",
  "min_replicas": 1,
  "max_replicas": 1,
  "gpu_count": 1,
  "gpu_vram_gb": 2,
  "vllm_args": "<string>"
}
'
{
  "deployment_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "status": "<string>"
}

Body

application/json

Request body for POST /api/v2/external/inference/create.

hf_model_id
string
required

HuggingFace model ID, e.g. 'meta-llama/Llama-3.2-1B'

target_rps
number
required

Target requests per second per replica for scale-up

target_latency_p95_ms
number
required

Target p95 latency in milliseconds for scale-up

stabilisation_window
integer
required

Stabilisation window in seconds for scale-down

gpu_type
string
required

GPU type, e.g. 'h100', 'a100'

hf_token
string | null

HuggingFace token for gated models

min_replicas
integer
default:1

Minimum replicas to keep running

Required range: x >= 1
max_replicas
integer
default:1

Maximum replicas allowed

Required range: x >= 1
gpu_count
integer
default:1

Number of GPUs per replica

Required range: x >= 1
gpu_vram_gb
integer | null

GPU VRAM in GB per GPU

Required range: x >= 1
vllm_args
string | null

Extra vLLM engine args, comma-separated without leading --

Response

Successful Response

Response body for POST /api/v2/external/inference/create.

deployment_id
string<uuid>
required

Stable UUID used for all subsequent inference calls

status
string
required

Initial deployment status, e.g. 'pending'