Skip to content

REST API

The Mullama daemon exposes a REST API for model management, text generation, system monitoring, and health checks. These endpoints complement the OpenAI and Anthropic compatibility APIs.

Base URL: http://localhost:8080 Content-Type: application/json


Management Endpoints

List Models

Retrieve all models known to the daemon (loaded and available).

GET /api/models

Example:

curl http://localhost:8080/api/models

Response:

{
  "models": [
    {
      "alias": "llama3.2:1b",
      "path": "/home/user/.cache/mullama/models/bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf",
      "parameters": 1236000000,
      "context_size": 4096,
      "gpu_layers": 35,
      "is_default": true,
      "active_requests": 0
    },
    {
      "alias": "qwen2.5:7b",
      "path": "/home/user/.cache/mullama/models/Qwen/Qwen2.5-7B-Instruct-GGUF/qwen2.5-7b-instruct-q4_k_m.gguf",
      "parameters": 7615000000,
      "context_size": 4096,
      "gpu_layers": 0,
      "is_default": false,
      "active_requests": 1
    }
  ]
}

Get Model Details

Retrieve detailed information about a specific loaded model.

GET /api/models/:name

Example:

curl http://localhost:8080/api/models/llama3.2:1b

Response:

{
  "alias": "llama3.2:1b",
  "path": "/home/user/.cache/mullama/models/bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf",
  "parameters": 1236000000,
  "context_size": 4096,
  "gpu_layers": 35,
  "is_default": true,
  "active_requests": 0,
  "architecture": "LlamaForCausalLM",
  "vocab_size": 128256,
  "capabilities": {
    "vision": false,
    "tools": false,
    "thinking": false,
    "json": true
  }
}

Pull Model

Download a model from HuggingFace. Supports streaming progress updates.

POST /api/models/pull

Request Body:

{
  "name": "llama3.2:1b",
  "stream": true
}
Field Type Required Description
name string Yes Model alias or HuggingFace spec
stream boolean No Stream progress updates (default: false)

Example:

curl -X POST http://localhost:8080/api/models/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2:1b", "stream": true}'

Non-Streaming Response:

{
  "status": "success",
  "model": "llama3.2:1b",
  "repo": "bartowski/Llama-3.2-1B-Instruct-GGUF",
  "size_bytes": 858993459
}

Streaming Response (NDJSON):

{"status":"downloading","progress":0.0,"total_bytes":858993459,"downloaded_bytes":0}
{"status":"downloading","progress":0.25,"total_bytes":858993459,"downloaded_bytes":214748364}
{"status":"downloading","progress":0.50,"total_bytes":858993459,"downloaded_bytes":429496729}
{"status":"downloading","progress":0.75,"total_bytes":858993459,"downloaded_bytes":644245094}
{"status":"downloading","progress":1.0,"total_bytes":858993459,"downloaded_bytes":858993459}
{"status":"verifying"}
{"status":"success","model":"llama3.2:1b"}

Load Model

Load a model into memory for inference.

POST /api/models/load

Request Body:

{
  "alias": "my-model",
  "path": "/path/to/model.gguf",
  "gpu_layers": 35,
  "context_size": 4096,
  "set_default": false
}
Field Type Required Description
alias string Yes Name/alias for the model
path string No Path to GGUF file (resolves alias if omitted)
gpu_layers integer No GPU layers to offload (default: 0)
context_size integer No Context window size (default: 4096)
mmproj string No Path to vision projector
set_default boolean No Set as default model

Example:

curl -X POST http://localhost:8080/api/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "alias": "my-model",
    "path": "/opt/models/custom.gguf",
    "gpu_layers": 35,
    "context_size": 8192
  }'

Response:

{
  "status": "loaded",
  "alias": "my-model",
  "parameters": 7000000000,
  "context_size": 8192,
  "gpu_layers": 35
}

Unload Model

Unload a model from memory, freeing resources.

POST /api/models/:name/unload

Example:

curl -X POST http://localhost:8080/api/models/llama3.2:1b/unload

Response:

{
  "status": "unloaded",
  "alias": "llama3.2:1b"
}

Delete Model

Remove a model from disk entirely.

DELETE /api/models/:name

Example:

curl -X DELETE http://localhost:8080/api/models/my-model

Response:

{
  "status": "deleted",
  "alias": "my-model"
}

Generation Endpoints

Text Generation

Generate text from a prompt. Supports streaming.

POST /api/generate

Request Body:

{
  "model": "llama3.2:1b",
  "prompt": "Explain quantum computing in simple terms",
  "max_tokens": 512,
  "temperature": 0.7,
  "stream": false
}
Field Type Required Default Description
model string No daemon default Model to use
prompt string Yes -- Input text
max_tokens integer No 512 Maximum tokens
temperature float No 0.7 Sampling temperature
top_p float No 0.9 Nucleus sampling
top_k integer No 40 Top-k sampling
stop array No -- Stop sequences
stream boolean No false Enable streaming
system string No -- System prompt

Example:

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "prompt": "What is the capital of France?",
    "max_tokens": 256
  }'

Non-Streaming Response:

{
  "model": "llama3.2:1b",
  "response": "The capital of France is Paris.",
  "done": true,
  "total_duration_ms": 450,
  "prompt_tokens": 8,
  "completion_tokens": 7,
  "tokens_per_second": 15.5
}

Streaming Response (NDJSON):

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:1b", "prompt": "Hello!", "stream": true}'
{"model":"llama3.2:1b","response":"Hello","done":false}
{"model":"llama3.2:1b","response":"!","done":false}
{"model":"llama3.2:1b","response":" How","done":false}
{"model":"llama3.2:1b","response":" can","done":false}
{"model":"llama3.2:1b","response":" I","done":false}
{"model":"llama3.2:1b","response":" help","done":false}
{"model":"llama3.2:1b","response":"?","done":false}
{"model":"llama3.2:1b","response":"","done":true,"total_duration_ms":320,"prompt_tokens":3,"completion_tokens":7,"tokens_per_second":21.8}

Chat Completion

Generate a chat completion from conversation history. Supports streaming.

POST /api/chat

Request Body:

{
  "model": "llama3.2:1b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
  ],
  "max_tokens": 256,
  "stream": false
}
Field Type Required Default Description
model string No daemon default Model to use
messages array Yes -- Conversation messages
max_tokens integer No 512 Maximum tokens
temperature float No 0.7 Sampling temperature
stream boolean No false Enable streaming

Example:

curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [
      {"role": "user", "content": "What is the meaning of life?"}
    ]
  }'

Response:

{
  "model": "llama3.2:1b",
  "message": {
    "role": "assistant",
    "content": "The meaning of life is a profound philosophical question..."
  },
  "done": true,
  "total_duration_ms": 890,
  "prompt_tokens": 12,
  "completion_tokens": 45
}

Generate Embeddings

Generate vector embeddings for input text.

POST /api/embeddings

Request Body:

{
  "model": "nomic-embed",
  "input": "Hello, world!"
}
Field Type Required Description
model string No Model to use (default: daemon default)
input string or array Yes Text(s) to embed

Example (single text):

curl -X POST http://localhost:8080/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": "Hello, world!"
  }'

Example (batch):

curl -X POST http://localhost:8080/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": ["Hello, world!", "How are you?", "Goodbye!"]
  }'

Response:

{
  "model": "nomic-embed",
  "embeddings": [
    [0.0023, -0.0091, 0.0152, 0.0284, -0.0037, ...],
  ],
  "dimensions": 768,
  "total_tokens": 4
}

System Endpoints

System Status

Get comprehensive system status including uptime, loaded models, and statistics.

GET /api/system/status

Example:

curl http://localhost:8080/api/system/status

Response:

{
  "version": "0.3.0",
  "uptime_secs": 3600,
  "models_loaded": 2,
  "http_endpoint": "http://0.0.0.0:8080",
  "ipc_endpoint": "ipc:///tmp/mullama.sock",
  "default_model": "llama3.2:1b",
  "stats": {
    "requests_total": 150,
    "tokens_generated": 45000,
    "active_requests": 1,
    "gpu_available": true,
    "memory_used_bytes": 4500000000,
    "memory_total_bytes": 16000000000
  }
}

Health Check

Simple health check that returns 200 if the daemon is running and healthy.

GET /health

Example:

curl http://localhost:8080/health

Response:

{
  "status": "ok"
}

HTTP status codes:

  • 200 -- Daemon is healthy
  • 503 -- Daemon is starting up or shutting down

Detailed Status

Returns more detailed status information about the daemon state.

GET /status

Example:

curl http://localhost:8080/status

Response:

{
  "status": "ok",
  "version": "0.3.0",
  "uptime_secs": 3600,
  "models_loaded": 2,
  "active_requests": 0
}

List Default Models

Get the list of pre-configured default models available for quick setup.

GET /api/defaults

Example:

curl http://localhost:8080/api/defaults

Response:

[
  {
    "name": "llama3.2-1b",
    "description": "Meta Llama 3.2 1B - Fast and lightweight",
    "size_hint": "1B",
    "tags": ["chat", "instruct", "fast", "lightweight"],
    "from": "hf:bartowski/Llama-3.2-1B-Instruct-GGUF",
    "has_thinking": false,
    "has_vision": false,
    "has_tools": false
  },
  {
    "name": "deepseek-r1-7b",
    "description": "DeepSeek R1 7B - Advanced reasoning model",
    "size_hint": "7B",
    "tags": ["reasoning", "thinking", "chain-of-thought"],
    "from": "hf:bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF",
    "has_thinking": true,
    "has_vision": false,
    "has_tools": false
  }
]

Use Default Model

Download and load one of the pre-configured default models.

POST /api/defaults/:name/use

Example:

curl -X POST http://localhost:8080/api/defaults/llama3.2-1b/use

Response:

{
  "status": "loading",
  "model": "llama3.2-1b",
  "message": "Downloading and loading model..."
}

Prometheus Metrics

Exposes metrics in Prometheus text exposition format for monitoring integration.

GET /metrics

Example:

curl http://localhost:8080/metrics

Response:

# HELP mullama_info Mullama daemon information
# TYPE mullama_info gauge
mullama_info{version="0.3.0"} 1

# HELP mullama_uptime_seconds Daemon uptime in seconds
# TYPE mullama_uptime_seconds counter
mullama_uptime_seconds 3600

# HELP mullama_models_loaded Number of currently loaded models
# TYPE mullama_models_loaded gauge
mullama_models_loaded 2

# HELP mullama_requests_total Total number of requests processed
# TYPE mullama_requests_total counter
mullama_requests_total{endpoint="chat"} 120
mullama_requests_total{endpoint="generate"} 25
mullama_requests_total{endpoint="embeddings"} 5

# HELP mullama_requests_active Currently active requests
# TYPE mullama_requests_active gauge
mullama_requests_active 1

# HELP mullama_tokens_generated_total Total tokens generated
# TYPE mullama_tokens_generated_total counter
mullama_tokens_generated_total 45000

# HELP mullama_tokens_per_second Average tokens per second
# TYPE mullama_tokens_per_second gauge
mullama_tokens_per_second{model="llama3.2:1b"} 35.2

# HELP mullama_prompt_tokens_total Total prompt tokens processed
# TYPE mullama_prompt_tokens_total counter
mullama_prompt_tokens_total 12500

# HELP mullama_model_parameters Model parameter count
# TYPE mullama_model_parameters gauge
mullama_model_parameters{model="llama3.2:1b"} 1236000000

# HELP mullama_model_context_size Model context size
# TYPE mullama_model_context_size gauge
mullama_model_context_size{model="llama3.2:1b"} 4096

# HELP mullama_model_gpu_layers Model GPU layers
# TYPE mullama_model_gpu_layers gauge
mullama_model_gpu_layers{model="llama3.2:1b"} 35

# HELP mullama_memory_used_bytes Memory used by models
# TYPE mullama_memory_used_bytes gauge
mullama_memory_used_bytes 4500000000

# HELP mullama_request_duration_seconds Request processing duration
# TYPE mullama_request_duration_seconds histogram
mullama_request_duration_seconds_bucket{le="0.1"} 10
mullama_request_duration_seconds_bucket{le="0.5"} 85
mullama_request_duration_seconds_bucket{le="1.0"} 130
mullama_request_duration_seconds_bucket{le="5.0"} 148
mullama_request_duration_seconds_bucket{le="+Inf"} 150
mullama_request_duration_seconds_count 150
mullama_request_duration_seconds_sum 95.5

Streaming Format

Streaming responses use Newline-Delimited JSON (NDJSON). Each line is a complete JSON object followed by a newline character (\n).

Content-Type: application/x-ndjson

The client reads line-by-line and parses each line as JSON. The final line contains "done": true with summary statistics.

curl with streaming

Use --no-buffer or -N with curl to see streaming output in real-time:

curl -N -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:1b", "prompt": "Hello!", "stream": true}'


Error Responses

All endpoints return errors in a consistent format:

{
  "error": {
    "message": "Model 'unknown-model' not found",
    "type": "not_found",
    "code": "model_not_found"
  }
}

Common HTTP Status Codes:

Code Meaning Common Causes
200 Success Request completed
400 Bad request Invalid JSON, missing required fields
404 Not found Model not loaded, invalid endpoint
409 Conflict Model already loaded/unloaded
413 Payload too large Request body exceeds limit
429 Too many requests Rate limit exceeded (if configured)
500 Internal error Model inference failed
503 Service unavailable Daemon starting up, model loading

Error Types:

Type Description
invalid_request Malformed request body
not_found Requested resource does not exist
model_error Model loading or inference failure
conflict Conflicting operation
internal_error Unexpected server error
overloaded Too many concurrent requests

CORS

The API server enables CORS with permissive settings by default:

  • Allowed Origins: Any (*)
  • Allowed Methods: Any
  • Allowed Headers: Any

This allows browser-based applications and the embedded Web UI to communicate with the API without restrictions. Configure allowed origins in the configuration for production deployments.


Route Summary

Method Endpoint Description
GET /api/models List all loaded models
GET /api/models/:name Get model details
POST /api/models/pull Download model from HuggingFace
POST /api/models/load Load model into memory
POST /api/models/:name/unload Unload model from memory
DELETE /api/models/:name Delete model from disk
POST /api/generate Text generation
POST /api/chat Chat completion
POST /api/embeddings Generate embeddings
GET /api/system/status System status
GET /api/defaults List default models
POST /api/defaults/:name/use Load a default model
GET /health Health check
GET /status Detailed status
GET /metrics Prometheus metrics
POST /v1/chat/completions OpenAI chat
POST /v1/completions OpenAI completions
POST /v1/embeddings OpenAI embeddings
GET /v1/models OpenAI models list
POST /v1/messages Anthropic messages