Skip to main content
A dedicated deployment is made up of one or more replicas — identical copies of the model serving requests. Each replica reports its own status (pending, running, stopped, failed) and a health flag updated by periodic health checks. The replica count autoscales between min_replicas and max_replicas based on the deployment’s scaling target. This page covers the read-side endpoints for inspecting what’s running, plus the model catalogue endpoints for discovering what’s available.

Inspecting deployments

MethodEndpointPurpose
GET/inference/listAll your deployments
GET/inference/get?deployment_id={id}Deployment details with replicas
DELETE/inference/stopStop a deployment
POST/v1/chat/completionsSend a chat completion to a deployment
GET /inference/get returns the deployment’s status, scaling config, and a list of replicas with their individual health and last health-check timestamps. By default it returns only active deployments — pass include_terminated=true to see stopped ones too (useful for auditing or cost analysis).

Picking a model to deploy

There’s no platform-wide model catalogue API to browse before deploying — any Hugging Face model ID can be passed to POST /inference/create. The dashboard’s Dedicated Inference page offers a curated grid of popular models as a convenience.