Tutorial: Building a Chatbot¶
This tutorial combines ZigLlama's HTTP server, chat-template system, and streaming generation into a working chatbot endpoint. You will configure the server, choose a chat template, send requests with curl, and observe token-by-token streaming output.
Prerequisites: Completed Your First Inference. A GGUF model file is recommended but not required (demo mode works without one).
Estimated time: 20 minutes.
Architecture Overview¶
sequenceDiagram
participant Client as curl / UI
participant Server as ZigLlamaServer
participant Template as ChatTemplateManager
participant Engine as TextGenerator
Client->>Server: POST /v1/chat/completions
Server->>Server: Validate API key
Server->>Template: convertChatToPrompt(messages)
Template-->>Server: formatted prompt string
Server->>Engine: generateText(prompt, config)
loop For each token
Engine-->>Server: StreamChunk
Server-->>Client: SSE data frame
end
Server-->>Client: data: [DONE] Step 1: Configure the Server¶
Create a ServerConfig with the settings appropriate for local development:
const http_server = @import("server/http_server.zig");
var config = http_server.ServerConfig{
.host = "127.0.0.1",
.port = 8080,
.cors_enabled = true,
.api_key = "sk-dev-chatbot", // require auth even in dev
.max_tokens = 1024,
.enable_streaming = true,
};
Binding to all interfaces
Change .host to "0.0.0.0" if you want to test from another machine on the same network. Remember to keep the API key set.
Step 2: Initialise the Server and Load a Model¶
const std = @import("std");
const models = @import("models/llama.zig");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// 1. Create server
var server = try http_server.ZigLlamaServer.init(allocator, config);
defer server.deinit();
// 2. Load model
const model_config = models.LLaMAConfig.init(.LLaMA_7B);
var model = try models.LLaMAModel.init(allocator, model_config);
defer model.deinit();
// 3. Load tokenizer (simplified for demo)
const tokenizer_mod = @import("models/tokenizer.zig");
var tokenizer = try tokenizer_mod.SimpleTokenizer.init(
allocator, model_config.vocab_size,
);
defer tokenizer.deinit();
// 4. Register model with server -- auto-detects chat template
try server.loadModel(&model, @ptrCast(&tokenizer), "llama-7b");
// 5. Start serving
try server.start();
}
loadModel calls ChatTemplateManager.detectTemplate on the model name and selects the best-matching template. For "llama-7b" this resolves to the LLaMA 2 template. The server pre-loads LLaMA 2, LLaMA 3, ChatML, and Mistral templates at startup.
Step 3: Choose a Chat Template¶
ZigLlama supports multiple chat-template formats. Each wraps the conversation history in format-specific delimiters:
| Template | Typical Models | System Prompt Support |
|---|---|---|
Llama2 | LLaMA 2 7B/13B/70B | Yes (<<SYS>> block) |
Llama3 | LLaMA 3 / 3.1 | Yes (<\|begin_of_text\|> tags) |
ChatML | Qwen, Yi, many fine-tunes | Yes (<\|im_start\|>system) |
Mistral | Mistral 7B, Mixtral | Yes ([INST] blocks) |
Alpaca | Alpaca-style fine-tunes | Yes (### Instruction:) |
Claude | Claude-format datasets | Yes (\n\nHuman: / \n\nAssistant:) |
The server auto-detects the template from the model name, but you can override it:
ChatML format
Step 4: Test with curl -- Non-Streaming¶
Open a terminal and send a chat completion request:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Authorization: Bearer sk-dev-chatbot" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-7b",
"messages": [
{"role": "system", "content": "You are a helpful Zig programming tutor."},
{"role": "user", "content": "How does defer work in Zig?"}
],
"max_tokens": 256,
"temperature": 0.7
}' | python3 -m json.tool
Expected response:
{
"id": "req_65f1a2b_0",
"object": "chat.completion",
"created": 1700000000,
"model": "llama-7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I'm ZigLlama, an educational AI assistant."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 13,
"total_tokens": 23
}
}
Demo responses
Without a real model loaded from a GGUF file, the server returns placeholder text. The request/response protocol is fully functional.
Step 5: Test with curl -- Streaming¶
Add "stream": true to receive Server-Sent Events:
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H "Authorization: Bearer sk-dev-chatbot" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-7b",
"messages": [
{"role": "user", "content": "Tell me about transformers."}
],
"max_tokens": 100,
"stream": true
}'
Output arrives as SSE frames, one per token:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,
"model":"llama-7b","choices":[{"index":0,"delta":{"role":"assistant",
"content":"Hello"},"finish_reason":null}]}
data: [DONE]
The -N flag disables buffering in curl so tokens appear in real time.
Step 6: Use Advanced Sampling¶
ZigLlama extends the OpenAI API with sampling_strategy, mirostat_tau, and grammar:
Mirostat -- Adaptive Temperature¶
Mirostat V2 maintains a target "surprise" level, dynamically adjusting temperature to produce consistently creative but coherent text:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Authorization: Bearer sk-dev-chatbot" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-7b",
"messages": [{"role": "user", "content": "Write a haiku about Zig."}],
"sampling_strategy": "mirostat",
"mirostat_tau": 5.0,
"max_tokens": 50
}'
Grammar-Constrained JSON¶
Force the model to produce valid JSON matching a schema:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Authorization: Bearer sk-dev-chatbot" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-7b",
"messages": [{"role": "user", "content": "Give me three colours."}],
"grammar": "{\"type\": \"array\", \"items\": {\"type\": \"string\"}}",
"max_tokens": 100
}'
Step 7: Integration with Python Clients¶
Because the API is OpenAI-compatible, the official Python client works out of the box:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="sk-dev-chatbot",
)
response = client.chat.completions.create(
model="llama-7b",
messages=[
{"role": "system", "content": "You are a Zig expert."},
{"role": "user", "content": "What is comptime?"},
],
max_tokens=200,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Deployment Checklist¶
Before exposing the chatbot beyond localhost:
- Load a real GGUF model (
--model path/to/model.gguf). - Set a strong, random API key (
--api-key $(openssl rand -hex 32)). - Place behind a TLS-terminating reverse proxy (NGINX, Caddy).
- Configure
--max-requeststo match your hardware's throughput. - Set
--timeoutto cover the worst-case generation latency. - Monitor the
/healthendpoint from your uptime checker.
Security
The built-in HTTP server does not support TLS. Always terminate TLS at a reverse proxy when accepting connections from untrusted networks.
What to Try Next¶
- Add a multi-turn conversation by extending the
messagesarray with alternating user/assistant turns. - Experiment with different
temperatureandtop_pvalues to see their effect on response style. - Explore the Demo Walkthroughs for more example programs, including multi-modal and threading demos.