Create Dedicated Deployment
Dedicated Inference
Create Dedicated Deployment
POST
Create Dedicated Deployment
Body
application/json
Request body for POST /api/v2/external/inference/create.
HuggingFace model ID, e.g. 'meta-llama/Llama-3.2-1B'
Target requests per second per replica for scale-up
Target p95 latency in milliseconds for scale-up
Stabilisation window in seconds for scale-down
GPU type, e.g. 'h100', 'a100'
HuggingFace token for gated models
Minimum replicas to keep running
Required range:
x >= 1Maximum replicas allowed
Required range:
x >= 1Number of GPUs per replica
Required range:
x >= 1GPU VRAM in GB per GPU
Required range:
x >= 1Extra vLLM engine args, comma-separated without leading --

