Skip to main content

Inference Commands

Spin up models on dedicated GPU instances for inference. The Lyceum inference system handles model deployment, querying, and lifecycle management For now all models are deployed on NVIDIA A100 GPUs and will be charged per second of uptime.

Commands

CommandDescription
lyceum infer deployDeploy a model on a dedicated GPU instance
lyceum infer modelsList all public models and private deployments
lyceum infer chatQuery a deployed model with a prompt
lyceum infer spindownSpin down a deployed model to free GPU capacity

lyceum infer deploy

Deploy a model from Hugging Face on a dedicated GPU instance.
lyceum infer deploy <model_id>

Arguments

ArgumentDescription
model_id(required) Hugging Face model ID (e.g., mistralai/Mistral-Small-24B-Instruct-2501)

Options

OptionDescription
--hf-tokenHugging Face token for gated models (required for Mistral, Llama, etc.)
You will need a Hugging Face token to deploy gated models. Get your token at https://huggingface.co/

Examples

# Deploy a public model
lyceum infer deploy mistralai/Mistral-Small-24B-Instruct-2501 --hf-token YOUR_TOKEN_HERE

# Deploy Llama with HF token
lyceum infer deploy meta-llama/Llama-3-70B-Instruct --hf-token YOUR_TOKEN_HERE

Finding Model IDs

Model IDs can be found on Hugging Face by navigating to a model page and copying its unique directory and name. For example:

lyceum infer models

List all available public models and your private deployments.
lyceum infer models

Examples

# View all models
lyceum infer models

lyceum infer chat

Perform inference with chat, image, or batch processing capabilities.
lyceum infer chat [OPTIONS]

Options

OptionDescription
-p, --promptThe message or path to file (.txt/.yaml/.xml)
-m, --modelModel to use. Default: gpt-4
-t, --tokensMax output tokens. Default: 1000
-n, --no-streamDisable streaming response
--typeOutput type (e.g., json, markdown). Default: text
-i, --imageImage path or base64
--urlImage URL
--dirDirectory of images
--base64Treat image input as base64
-b, --batchJSONL file for batch processing

Examples

# Basic chat with a model
lyceum infer chat -m gpt-4 -p "Explain quantum computing"

# Use a custom deployed model
lyceum infer chat -m mistralai/Mistral-Small-24B-Instruct-2501 -p "Write a Python function to sort a list"

# Chat with prompt from file
lyceum infer chat -m gpt-4 -p prompt.txt

# Disable streaming
lyceum infer chat -m gpt-4 -p "What is AI?" --no-stream

# Request JSON output
lyceum infer chat -m gpt-4 -p "List 3 programming languages" --type json

# Process image with text prompt
lyceum infer chat -m gpt-4 -p "Describe this image" -i image.png

# Process image from URL
lyceum infer chat -m gpt-4 -p "What's in this image?" --url https://example.com/image.jpg

# Batch processing from JSONL file
lyceum infer chat -b requests.jsonl

lyceum infer spindown

Spin down a deployed model to optimize costs and free GPU capacity.
lyceum infer spindown <model_id>

Arguments

ArgumentDescription
model_id(required) Model ID to spin down

Examples

# Spin down a model
lyceum infer spindown mistralai/Mistral-Small-24B-Instruct-2501

Help

Every inference command is self-documenting. Use the --help flag to see available arguments and options:
lyceum infer --help
lyceum infer deploy --help
lyceum infer chat --help