pending, running, stopped, failed) and a health flag updated by periodic health checks. The replica count autoscales between min_replicas and max_replicas based on the deployment’s scaling target.
This page covers the read-side endpoints for inspecting what’s running, plus the model catalogue endpoints for discovering what’s available.
Inspecting deployments
| Method | Endpoint | Purpose |
|---|---|---|
GET | /inference/list | All your deployments |
GET | /inference/get?deployment_id={id} | Deployment details with replicas |
DELETE | /inference/stop | Stop a deployment |
POST | /v1/chat/completions | Send a chat completion to a deployment |
GET /inference/get returns the deployment’s status, scaling config, and a list of replicas with their individual health and last health-check timestamps. By default it returns only active deployments — pass include_terminated=true to see stopped ones too (useful for auditing or cost analysis).
Picking a model to deploy
There’s no platform-wide model catalogue API to browse before deploying — any Hugging Face model ID can be passed toPOST /inference/create. The dashboard’s Dedicated Inference page offers a curated grid of popular models as a convenience.
