Skip to content

HTTP Server

ZigLlama ships an HTTP server that exposes an OpenAI-compatible REST API. Any client library that speaks the OpenAI chat-completions protocol -- openai-python, langchain, curl -- can target ZigLlama with zero code changes. The server adds ZigLlama-specific extensions for advanced sampling, grammar-constrained decoding, and Mirostat temperature control.


Server Architecture

ServerConfig

Every tunable is captured in a single configuration struct:

pub const ServerConfig = struct {
    host: []const u8 = "127.0.0.1",
    port: u16 = 8080,
    max_connections: u32 = 100,
    timeout_seconds: u32 = 30,
    cors_enabled: bool = true,
    api_key: ?[]const u8 = null,
    max_tokens: u32 = 2048,
    enable_streaming: bool = true,
};

Defaults

The server binds to 127.0.0.1:8080 with CORS enabled and no API key. Set api_key to a non-null value to require Authorization: Bearer <key> on every request.

Endpoint Routing

Routing is performed via a simple method + path comparison inside handleRequest. The server recognises four endpoints:

flowchart TD
    REQ[Incoming Request] --> CORS{CORS preflight?}
    CORS -- OPTIONS --> PREFLIGHT[Return 204]
    CORS -- No --> AUTH{API key valid?}
    AUTH -- No --> E401[401 Unauthorized]
    AUTH -- Yes --> ROUTE{Route}
    ROUTE -- "GET /health" --> HEALTH[handleHealth]
    ROUTE -- "GET /v1/models" --> MODELS[handleModels]
    ROUTE -- "POST /v1/chat/completions" --> CHAT[handleChatCompletions]
    ROUTE -- "POST /v1/completions" --> COMP[handleCompletions]
    ROUTE -- else --> E404[404 Not Found]

OpenAI-Compatible API

POST /v1/chat/completions

The primary endpoint. Accepts a JSON body conforming to ChatCompletionRequest:

pub const ChatCompletionRequest = struct {
    model: []const u8,
    messages: []ChatMessage,
    max_tokens: ?u32 = null,
    temperature: ?f32 = null,
    top_p: ?f32 = null,
    top_k: ?u32 = null,
    frequency_penalty: ?f32 = null,
    presence_penalty: ?f32 = null,
    stop: ?[][]const u8 = null,
    stream: ?bool = null,
    // ZigLlama extensions (see below)
    sampling_strategy: ?[]const u8 = null,
    grammar: ?[]const u8 = null,
    mirostat_tau: ?f32 = null,
    typical_mass: ?f32 = null,
};

Each ChatMessage carries a role ("system", "user", "assistant") and content string. The server converts the message array into a prompt using the auto-detected chat template (LLaMA 2, LLaMA 3, ChatML, Mistral, etc.).

Request Example

{
  "model": "llama-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain attention in transformers."}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.9,
  "stream": false
}

Response

The response follows ChatCompletionResponse:

{
  "id": "req_65f1a2b_0",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "llama-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Attention allows the model to weigh..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 48,
    "total_tokens": 72
  }
}

POST /v1/completions

Plain text completion (no chat formatting). Accepts CompletionRequest with a prompt field instead of messages:

{
  "model": "llama-7b",
  "prompt": "The future of AI is",
  "max_tokens": 50,
  "temperature": 0.8
}

Returns a CompletionResponse whose choices[].text contains the generated continuation.

GET /v1/models

Lists every model the server is aware of:

{
  "object": "list",
  "data": [
    {"id": "llama-7b", "object": "model", "created": 1677610602, "owned_by": "zigllama"},
    {"id": "llama-13b", "object": "model", "created": 1677610602, "owned_by": "zigllama"},
    {"id": "gpt2-124m", "object": "model", "created": 1677610602, "owned_by": "zigllama"},
    {"id": "mistral-7b", "object": "model", "created": 1677610602, "owned_by": "zigllama"}
  ]
}

GET /health

Returns a lightweight health-check payload:

{"status": "healthy", "service": "zigllama", "version": "1.0.0"}

Load balancer integration

Point your load balancer's health probe at /health. The endpoint allocates no memory beyond the static response string and returns in sub-millisecond time.


ZigLlama Extensions

The following fields are not part of the OpenAI specification and are silently ignored by standard clients:

Field Type Description
sampling_strategy ?string "mirostat" or "typical" -- selects an advanced sampler.
grammar ?string JSON schema or regex pattern for grammar-constrained output.
mirostat_tau ?f32 Target surprise value for Mirostat V2 (default 3.0).
typical_mass ?f32 Probability mass for Typical sampling (default 0.9).

Mirostat-controlled generation

{
  "model": "llama-7b",
  "messages": [{"role": "user", "content": "Write a poem."}],
  "sampling_strategy": "mirostat",
  "mirostat_tau": 5.0,
  "max_tokens": 200
}

Grammar-constrained JSON output

{
  "model": "llama-7b",
  "messages": [{"role": "user", "content": "List 3 colours."}],
  "grammar": "{\"type\": \"array\", \"items\": {\"type\": \"string\"}}",
  "max_tokens": 100
}

Streaming via Server-Sent Events

When "stream": true, the server responds with Content-Type: text/event-stream and writes one SSE frame per generated token:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,
       "model":"llama-7b","choices":[{"index":0,"delta":{"role":"assistant",
       "content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,
       "model":"llama-7b","choices":[{"index":0,"delta":{"content":" world"},
       "finish_reason":null}]}

data: [DONE]

SSE protocol

Each frame is prefixed with data: and terminated by two newlines (\n\n). The final frame is always the literal string data: [DONE]\n\n. Headers include Cache-Control: no-cache and Connection: keep-alive to prevent intermediate proxies from buffering.

The StreamChunk and StreamChoice structs mirror their non-streaming counterparts, except that message is replaced by delta containing only the incremental content.


Authentication

Authentication is optional and controlled by ServerConfig.api_key:

server_config.api_key = "sk-my-secret-key";

When set, every request must include:

Authorization: Bearer sk-my-secret-key

The server checks the header before routing. A missing or invalid key returns:

{"error": {"message": "Missing or invalid authorization header", "type": "invalid_request_error", "code": 401}}

Key storage

Pass the API key via the --api-key CLI flag or an environment variable. Do not hard-code keys in source files.


CORS Configuration

Cross-Origin Resource Sharing headers are injected when cors_enabled is true (the default):

Header Value
Access-Control-Allow-Origin *
Access-Control-Allow-Methods GET, POST, OPTIONS
Access-Control-Allow-Headers Content-Type, Authorization
Access-Control-Max-Age 86400 (preflight cache)

OPTIONS preflight requests are answered with an empty body and status 204. Disable CORS via --no-cors when the server sits behind a reverse proxy that handles it.


Deployment Considerations

Production readiness

ZigLlama's HTTP server is designed for educational and development use. For high-traffic production deployments, consider placing it behind a reverse proxy (NGINX, Caddy) that provides TLS termination, rate limiting, and connection pooling.

Recommended architecture:

graph LR
    Client --> LB[Load Balancer / TLS]
    LB --> ZS1[ZigLlama :8080]
    LB --> ZS2[ZigLlama :8081]
    ZS1 --> Model[Shared Model File via mmap]
    ZS2 --> Model

Key parameters for tuning:

Parameter Default Guidance
max_connections 100 Match to expected concurrent users.
timeout_seconds 30 Increase for large max_tokens values.
max_tokens 2048 Cap to prevent runaway generation.
enable_streaming true Disable only when clients do not support SSE.

Memory planning:

The server's memory footprint is dominated by the loaded model. A 7B-parameter model in Q4_0 quantisation requires approximately 3.5 GB. The HTTP layer adds negligible overhead -- each in-flight request allocates at most 1 MB for the request body and a few kilobytes for response serialisation.


Source Reference

File Key Types
src/server/http_server.zig ZigLlamaServer, ServerConfig, ChatCompletionRequest, ChatCompletionResponse, StreamChunk
src/server/cli.zig CLI argument parser, startup banner