Skip to main content
Dedicated deployments host any Hugging Face model on reserved GPU replicas and expose it through an OpenAI-compatible Chat Completions endpoint.

Deploy

CLI

lyceum infer deploy meta-llama/Llama-3.1-8B-Instruct \
  --hardware-profile gpu.a100 \
  --min-replicas 1 \
  --max-replicas 3

REST API

curl -X POST https://api.lyceum.technology/api/v2/external/inference/create \
  -H "Authorization: Bearer $LYCEUM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "hf_model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "hardware_profile": "gpu.a100",
    "min_replicas": 1,
    "max_replicas": 3
  }'
For gated models, include hf_token in the request body or pass --hf-token to the CLI.

Wait for it to be healthy

lyceum infer status <deployment_id> --wait
Or poll the API:
curl "https://api.lyceum.technology/api/v2/external/inference/get?deployment_id=<deployment_id>" \
  -H "Authorization: Bearer $LYCEUM_API_KEY"

Call the deployment

curl https://api.lyceum.technology/api/v2/external/v1/chat/completions \
  -H "Authorization: Bearer $LYCEUM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<deployment_id>",
    "messages": [{"role": "user", "content": "Say hello in three languages."}]
  }'
The response shape mirrors OpenAI’s: id, choices, usage, model, created.

From the OpenAI Python SDK

Because the endpoint is OpenAI-compatible, you can point the official openai client at it:
from openai import OpenAI

client = OpenAI(
    api_key="<your_lyceum_api_key>",
    base_url="https://api.lyceum.technology/api/v2/external",
)

resp = client.chat.completions.create(
    model="<deployment_id>",
    messages=[{"role": "user", "content": "hello"}],
)
print(resp.choices[0].message.content)

Stop the deployment

lyceum infer stop <deployment_id>
curl -X DELETE https://api.lyceum.technology/api/v2/external/inference/stop \
  -H "Authorization: Bearer $LYCEUM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"deployment_id": "<deployment_id>"}'