HTTP Server¶
ZigLlama ships an HTTP server that exposes an OpenAI-compatible REST API. Any client library that speaks the OpenAI chat-completions protocol -- openai-python, langchain, curl -- can target ZigLlama with zero code changes. The server adds ZigLlama-specific extensions for advanced sampling, grammar-constrained decoding, and Mirostat temperature control.
Server Architecture¶
ServerConfig¶
Every tunable is captured in a single configuration struct:
pub const ServerConfig = struct {
host: []const u8 = "127.0.0.1",
port: u16 = 8080,
max_connections: u32 = 100,
timeout_seconds: u32 = 30,
cors_enabled: bool = true,
api_key: ?[]const u8 = null,
max_tokens: u32 = 2048,
enable_streaming: bool = true,
};
Defaults
The server binds to 127.0.0.1:8080 with CORS enabled and no API key. Set api_key to a non-null value to require Authorization: Bearer <key> on every request.
Endpoint Routing¶
Routing is performed via a simple method + path comparison inside handleRequest. The server recognises four endpoints:
flowchart TD
REQ[Incoming Request] --> CORS{CORS preflight?}
CORS -- OPTIONS --> PREFLIGHT[Return 204]
CORS -- No --> AUTH{API key valid?}
AUTH -- No --> E401[401 Unauthorized]
AUTH -- Yes --> ROUTE{Route}
ROUTE -- "GET /health" --> HEALTH[handleHealth]
ROUTE -- "GET /v1/models" --> MODELS[handleModels]
ROUTE -- "POST /v1/chat/completions" --> CHAT[handleChatCompletions]
ROUTE -- "POST /v1/completions" --> COMP[handleCompletions]
ROUTE -- else --> E404[404 Not Found] OpenAI-Compatible API¶
POST /v1/chat/completions¶
The primary endpoint. Accepts a JSON body conforming to ChatCompletionRequest:
pub const ChatCompletionRequest = struct {
model: []const u8,
messages: []ChatMessage,
max_tokens: ?u32 = null,
temperature: ?f32 = null,
top_p: ?f32 = null,
top_k: ?u32 = null,
frequency_penalty: ?f32 = null,
presence_penalty: ?f32 = null,
stop: ?[][]const u8 = null,
stream: ?bool = null,
// ZigLlama extensions (see below)
sampling_strategy: ?[]const u8 = null,
grammar: ?[]const u8 = null,
mirostat_tau: ?f32 = null,
typical_mass: ?f32 = null,
};
Each ChatMessage carries a role ("system", "user", "assistant") and content string. The server converts the message array into a prompt using the auto-detected chat template (LLaMA 2, LLaMA 3, ChatML, Mistral, etc.).
Request Example¶
{
"model": "llama-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain attention in transformers."}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.9,
"stream": false
}
Response¶
The response follows ChatCompletionResponse:
{
"id": "req_65f1a2b_0",
"object": "chat.completion",
"created": 1700000000,
"model": "llama-7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Attention allows the model to weigh..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 48,
"total_tokens": 72
}
}
POST /v1/completions¶
Plain text completion (no chat formatting). Accepts CompletionRequest with a prompt field instead of messages:
Returns a CompletionResponse whose choices[].text contains the generated continuation.
GET /v1/models¶
Lists every model the server is aware of:
{
"object": "list",
"data": [
{"id": "llama-7b", "object": "model", "created": 1677610602, "owned_by": "zigllama"},
{"id": "llama-13b", "object": "model", "created": 1677610602, "owned_by": "zigllama"},
{"id": "gpt2-124m", "object": "model", "created": 1677610602, "owned_by": "zigllama"},
{"id": "mistral-7b", "object": "model", "created": 1677610602, "owned_by": "zigllama"}
]
}
GET /health¶
Returns a lightweight health-check payload:
Load balancer integration
Point your load balancer's health probe at /health. The endpoint allocates no memory beyond the static response string and returns in sub-millisecond time.
ZigLlama Extensions¶
The following fields are not part of the OpenAI specification and are silently ignored by standard clients:
| Field | Type | Description |
|---|---|---|
sampling_strategy | ?string | "mirostat" or "typical" -- selects an advanced sampler. |
grammar | ?string | JSON schema or regex pattern for grammar-constrained output. |
mirostat_tau | ?f32 | Target surprise value for Mirostat V2 (default 3.0). |
typical_mass | ?f32 | Probability mass for Typical sampling (default 0.9). |
Mirostat-controlled generation
Grammar-constrained JSON output
Streaming via Server-Sent Events¶
When "stream": true, the server responds with Content-Type: text/event-stream and writes one SSE frame per generated token:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,
"model":"llama-7b","choices":[{"index":0,"delta":{"role":"assistant",
"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,
"model":"llama-7b","choices":[{"index":0,"delta":{"content":" world"},
"finish_reason":null}]}
data: [DONE]
SSE protocol
Each frame is prefixed with data: and terminated by two newlines (\n\n). The final frame is always the literal string data: [DONE]\n\n. Headers include Cache-Control: no-cache and Connection: keep-alive to prevent intermediate proxies from buffering.
The StreamChunk and StreamChoice structs mirror their non-streaming counterparts, except that message is replaced by delta containing only the incremental content.
Authentication¶
Authentication is optional and controlled by ServerConfig.api_key:
When set, every request must include:
The server checks the header before routing. A missing or invalid key returns:
{"error": {"message": "Missing or invalid authorization header", "type": "invalid_request_error", "code": 401}}
Key storage
Pass the API key via the --api-key CLI flag or an environment variable. Do not hard-code keys in source files.
CORS Configuration¶
Cross-Origin Resource Sharing headers are injected when cors_enabled is true (the default):
| Header | Value |
|---|---|
Access-Control-Allow-Origin | * |
Access-Control-Allow-Methods | GET, POST, OPTIONS |
Access-Control-Allow-Headers | Content-Type, Authorization |
Access-Control-Max-Age | 86400 (preflight cache) |
OPTIONS preflight requests are answered with an empty body and status 204. Disable CORS via --no-cors when the server sits behind a reverse proxy that handles it.
Deployment Considerations¶
Production readiness
ZigLlama's HTTP server is designed for educational and development use. For high-traffic production deployments, consider placing it behind a reverse proxy (NGINX, Caddy) that provides TLS termination, rate limiting, and connection pooling.
Recommended architecture:
graph LR
Client --> LB[Load Balancer / TLS]
LB --> ZS1[ZigLlama :8080]
LB --> ZS2[ZigLlama :8081]
ZS1 --> Model[Shared Model File via mmap]
ZS2 --> Model Key parameters for tuning:
| Parameter | Default | Guidance |
|---|---|---|
max_connections | 100 | Match to expected concurrent users. |
timeout_seconds | 30 | Increase for large max_tokens values. |
max_tokens | 2048 | Cap to prevent runaway generation. |
enable_streaming | true | Disable only when clients do not support SSE. |
Memory planning:
The server's memory footprint is dominated by the loaded model. A 7B-parameter model in Q4_0 quantisation requires approximately 3.5 GB. The HTTP layer adds negligible overhead -- each in-flight request allocates at most 1 MB for the request body and a few kilobytes for response serialisation.
Source Reference¶
| File | Key Types |
|---|---|
src/server/http_server.zig | ZigLlamaServer, ServerConfig, ChatCompletionRequest, ChatCompletionResponse, StreamChunk |
src/server/cli.zig | CLI argument parser, startup banner |