Production Deployment¶
This guide covers deploying the Mullama daemon in production environments with proper process management, security, monitoring, and high availability.
Systemd Service¶
Service File¶
Create /etc/systemd/system/mullama.service:
[Unit]
Description=Mullama LLM Inference Daemon
Documentation=https://docs.mullama.cognisoc.com/daemon/
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=mullama
Group=mullama
WorkingDirectory=/opt/mullama
# Binary and arguments
ExecStart=/opt/mullama/bin/mullama serve \
--model llama3.2:1b \
--model qwen2.5:7b \
--http-port 8080 \
--http-addr 127.0.0.1 \
--gpu-layers 35 \
--context-size 8192 \
--socket ipc:///var/run/mullama/mullama.sock
# Graceful shutdown
ExecStop=/opt/mullama/bin/mullama daemon stop --socket ipc:///var/run/mullama/mullama.sock
TimeoutStopSec=30
KillMode=mixed
KillSignal=SIGTERM
# Restart policy
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=300
StartLimitBurst=5
# Environment
Environment=MULLAMA_CACHE_DIR=/opt/mullama/models
Environment=MULLAMA_LOG_LEVEL=info
Environment=HF_TOKEN=
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/mullama/models /var/run/mullama /var/log/mullama
PrivateTmp=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictNamespaces=true
# Resource limits
LimitNOFILE=65536
MemoryMax=32G
CPUQuota=800%
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=mullama
# Runtime directory
RuntimeDirectory=mullama
RuntimeDirectoryMode=0755
[Install]
WantedBy=multi-user.target
Setup Commands¶
# Create user and directories
sudo useradd --system --shell /bin/false --home-dir /opt/mullama mullama
sudo mkdir -p /opt/mullama/{bin,models}
sudo mkdir -p /var/run/mullama
sudo mkdir -p /var/log/mullama
# Copy binary
sudo cp target/release/mullama /opt/mullama/bin/
sudo chown -R mullama:mullama /opt/mullama
# Pre-download models
sudo -u mullama MULLAMA_CACHE_DIR=/opt/mullama/models /opt/mullama/bin/mullama pull llama3.2:1b
sudo -u mullama MULLAMA_CACHE_DIR=/opt/mullama/models /opt/mullama/bin/mullama pull qwen2.5:7b
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable mullama
sudo systemctl start mullama
# Check status
sudo systemctl status mullama
sudo journalctl -u mullama -f
GPU Access¶
For NVIDIA GPU access, add the user to the appropriate group and allow device access:
# Add to [Service] section
SupplementaryGroups=video render
DeviceAllow=/dev/nvidia0 rw
DeviceAllow=/dev/nvidiactl rw
DeviceAllow=/dev/nvidia-uvm rw
DeviceAllow=/dev/nvidia-uvm-tools rw
Docker Deployment¶
Dockerfile¶
FROM rust:1.75-bookworm AS builder
# Install dependencies
RUN apt-get update && apt-get install -y \
cmake \
pkg-config \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Build mullama
WORKDIR /src
COPY . .
RUN git submodule update --init --recursive
RUN cargo build --release --features daemon
# Runtime image
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y \
libssl3 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --system --shell /bin/false mullama
RUN mkdir -p /opt/mullama/models && chown -R mullama:mullama /opt/mullama
COPY --from=builder /src/target/release/mullama /usr/local/bin/mullama
USER mullama
WORKDIR /opt/mullama
ENV MULLAMA_CACHE_DIR=/opt/mullama/models
ENV MULLAMA_LOG_LEVEL=info
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
CMD curl -sf http://localhost:8080/health || exit 1
ENTRYPOINT ["mullama"]
CMD ["serve", "--http-addr", "0.0.0.0", "--http-port", "8080"]
Dockerfile with CUDA¶
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y \
curl \
cmake \
pkg-config \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
WORKDIR /src
COPY . .
RUN git submodule update --init --recursive
ENV LLAMA_CUDA=1
RUN cargo build --release --features daemon
# Runtime
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
libssl3 \
ca-certificates \
curl \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --system --shell /bin/false mullama
RUN mkdir -p /opt/mullama/models && chown -R mullama:mullama /opt/mullama
COPY --from=builder /src/target/release/mullama /usr/local/bin/mullama
USER mullama
WORKDIR /opt/mullama
ENV MULLAMA_CACHE_DIR=/opt/mullama/models
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
CMD curl -sf http://localhost:8080/health || exit 1
ENTRYPOINT ["mullama"]
CMD ["serve", "--http-addr", "0.0.0.0", "--http-port", "8080", "--gpu-layers", "99"]
Docker Run¶
# CPU only
docker build -t mullama .
docker run -d \
--name mullama \
-p 8080:8080 \
-v mullama-models:/opt/mullama/models \
mullama \
serve --model llama3.2:1b --http-addr 0.0.0.0
# With NVIDIA GPU
docker build -f Dockerfile.cuda -t mullama-gpu .
docker run -d \
--name mullama-gpu \
--gpus all \
-p 8080:8080 \
-v mullama-models:/opt/mullama/models \
mullama-gpu \
serve --model llama3.2:1b --gpu-layers 35 --http-addr 0.0.0.0
Docker Compose¶
Basic Setup¶
version: '3.8'
services:
mullama:
build: .
container_name: mullama
ports:
- "8080:8080"
volumes:
- mullama-models:/opt/mullama/models
environment:
- MULLAMA_CACHE_DIR=/opt/mullama/models
- MULLAMA_LOG_LEVEL=info
command: serve --model llama3.2:1b --http-addr 0.0.0.0 --http-port 8080
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
start_period: 60s
retries: 3
deploy:
resources:
limits:
memory: 16G
cpus: '8'
volumes:
mullama-models:
With NVIDIA GPU¶
version: '3.8'
services:
mullama:
build:
context: .
dockerfile: Dockerfile.cuda
container_name: mullama-gpu
ports:
- "8080:8080"
volumes:
- mullama-models:/opt/mullama/models
environment:
- MULLAMA_CACHE_DIR=/opt/mullama/models
- MULLAMA_LOG_LEVEL=info
- NVIDIA_VISIBLE_DEVICES=all
command: >
serve
--model llama3.2:1b
--model qwen2.5:7b
--gpu-layers 35
--context-size 8192
--http-addr 0.0.0.0
--http-port 8080
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 32G
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
start_period: 120s
retries: 3
volumes:
mullama-models:
driver: local
Full Stack (with Monitoring)¶
version: '3.8'
services:
mullama:
build:
context: .
dockerfile: Dockerfile.cuda
container_name: mullama
ports:
- "8080:8080"
volumes:
- mullama-models:/opt/mullama/models
environment:
- MULLAMA_CACHE_DIR=/opt/mullama/models
- MULLAMA_LOG_LEVEL=info
command: >
serve
--model llama3.2:1b
--gpu-layers 35
--http-addr 0.0.0.0
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
start_period: 120s
retries: 3
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped
volumes:
mullama-models:
prometheus-data:
grafana-data:
Reverse Proxy¶
Nginx (HTTPS Termination)¶
upstream mullama_backend {
server 127.0.0.1:8080;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name llm.example.com;
ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Security headers
add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
add_header X-XSS-Protection "1; mode=block";
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
# Request size limit (for base64 images in vision requests)
client_max_body_size 50M;
# Timeouts for long-running generation
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
proxy_read_timeout 300s;
# API endpoints
location / {
proxy_pass http://mullama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# SSE streaming support
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding off;
}
# Health check (no auth required)
location /health {
proxy_pass http://mullama_backend/health;
access_log off;
}
# Metrics (restrict access)
location /metrics {
proxy_pass http://mullama_backend/metrics;
allow 10.0.0.0/8;
allow 172.16.0.0/12;
allow 192.168.0.0/16;
deny all;
}
# Optional: Basic auth for API
# location /v1/ {
# auth_basic "Mullama API";
# auth_basic_user_file /etc/nginx/htpasswd;
# proxy_pass http://mullama_backend;
# proxy_buffering off;
# proxy_set_header Connection '';
# proxy_http_version 1.1;
# }
}
# Redirect HTTP to HTTPS
server {
listen 80;
server_name llm.example.com;
return 301 https://$server_name$request_uri;
}
Nginx (Load Balancing Multiple Instances)¶
upstream mullama_pool {
least_conn;
server 127.0.0.1:8080 weight=1;
server 127.0.0.1:8081 weight=1;
server 127.0.0.1:8082 weight=1;
keepalive 64;
}
server {
listen 443 ssl http2;
server_name llm.example.com;
ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;
client_max_body_size 50M;
proxy_read_timeout 300s;
location / {
proxy_pass http://mullama_pool;
proxy_buffering off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Connection '';
proxy_http_version 1.1;
}
}
Load Balancing¶
Multiple Instances¶
Run multiple Mullama instances with different models or configurations:
# Instance 1: Fast models (low latency)
mullama serve --model llama3.2:1b --http-port 8080 \
--socket ipc:///var/run/mullama-1.sock --gpu-layers 35
# Instance 2: Large models (high quality)
mullama serve --model qwen2.5:7b --http-port 8081 \
--socket ipc:///var/run/mullama-2.sock --gpu-layers 35
# Instance 3: Embedding models
mullama serve --model nomic-embed --http-port 8082 \
--socket ipc:///var/run/mullama-3.sock
HAProxy Configuration¶
global
maxconn 4096
log /dev/log local0
defaults
mode http
timeout connect 10s
timeout client 300s
timeout server 300s
option httplog
frontend mullama_front
bind *:443 ssl crt /etc/haproxy/certs/llm.pem
default_backend mullama_back
# Route embeddings to dedicated instance
acl is_embeddings path_beg /v1/embeddings
use_backend mullama_embed if is_embeddings
backend mullama_back
balance leastconn
option httpchk GET /health
http-check expect status 200
server mullama1 127.0.0.1:8080 check inter 10s fall 3 rise 2
server mullama2 127.0.0.1:8081 check inter 10s fall 3 rise 2
backend mullama_embed
server embed1 127.0.0.1:8082 check inter 10s fall 3 rise 2
Monitoring¶
Prometheus Scraping Configuration¶
Add to prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'mullama'
metrics_path: '/metrics'
static_configs:
- targets:
- 'localhost:8080'
- 'localhost:8081'
labels:
service: 'mullama'
scrape_interval: 10s
Key Metrics to Monitor¶
| Metric | Type | Alert Threshold | Description |
|---|---|---|---|
mullama_requests_active |
gauge | > 10 | Currently processing requests |
mullama_models_loaded |
gauge | = 0 | No models loaded |
mullama_memory_used_bytes |
gauge | > 90% system | Memory pressure |
mullama_tokens_per_second |
gauge | < 5 | Slow generation |
mullama_request_duration_seconds |
histogram | p99 > 30s | Slow requests |
mullama_uptime_seconds |
counter | resets | Unexpected restarts |
Alerting Rules¶
groups:
- name: mullama_alerts
rules:
- alert: MullamaDown
expr: up{job="mullama"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Mullama daemon is down"
- alert: MullamaNoModels
expr: mullama_models_loaded == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No models loaded in Mullama"
- alert: MullamaHighLatency
expr: histogram_quantile(0.99, mullama_request_duration_seconds_bucket) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Mullama p99 latency exceeds 30s"
- alert: MullamaHighMemory
expr: mullama_memory_used_bytes / node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Mullama using >90% system memory"
Grafana Dashboard¶
Import the following dashboard JSON or create panels for:
- Request rate (requests/second by endpoint)
- Token generation speed (tokens/second by model)
- Active requests gauge
- Memory usage by model
- Request duration histogram
- Model load status
- Error rate
Health Checks¶
HTTP Health Check¶
# Simple health check
curl -sf http://localhost:8080/health || echo "UNHEALTHY"
# Detailed status check
curl -sf http://localhost:8080/api/system/status | jq '{
status: "ok",
models: .models_loaded,
uptime: .uptime_secs,
active: .stats.active_requests
}'
Kubernetes Probes¶
apiVersion: v1
kind: Pod
spec:
containers:
- name: mullama
image: mullama:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 30
Script-Based Health Check¶
#!/bin/bash
# /opt/mullama/healthcheck.sh
set -e
ENDPOINT="${MULLAMA_HEALTH_URL:-http://localhost:8080/health}"
TIMEOUT=5
response=$(curl -sf --max-time $TIMEOUT "$ENDPOINT" 2>/dev/null)
if [ $? -ne 0 ]; then
echo "CRITICAL: Mullama daemon not responding"
exit 2
fi
status=$(echo "$response" | jq -r '.status')
if [ "$status" != "ok" ]; then
echo "WARNING: Mullama status is $status"
exit 1
fi
echo "OK: Mullama daemon healthy"
exit 0
Log Rotation¶
Logrotate Configuration¶
Create /etc/logrotate.d/mullama:
/var/log/mullama/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 mullama mullama
sharedscripts
postrotate
systemctl reload mullama 2>/dev/null || true
endscript
}
Journald Configuration¶
For systemd-managed logging, configure /etc/systemd/journald.conf:
Security¶
Running as Non-Root¶
Always run Mullama as a dedicated non-root user:
# Create dedicated user
sudo useradd --system --shell /bin/false --home-dir /opt/mullama mullama
# Set file ownership
sudo chown -R mullama:mullama /opt/mullama
Firewall Rules (UFW)¶
# Allow only specific IPs to access the API
sudo ufw allow from 10.0.0.0/8 to any port 8080 proto tcp
sudo ufw allow from 192.168.1.0/24 to any port 8080 proto tcp
# Allow Prometheus scraping from monitoring network
sudo ufw allow from 10.0.10.0/24 to any port 8080 proto tcp
# Deny all other access to port 8080
sudo ufw deny 8080/tcp
Firewall Rules (iptables)¶
# Allow from specific subnet
iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/8 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -s 192.168.1.0/24 -j ACCEPT
# Drop all other traffic to port 8080
iptables -A INPUT -p tcp --dport 8080 -j DROP
API Key Authentication¶
Configure an API key requirement in the daemon:
# Set API key via environment variable
export MULLAMA_API_KEY="your-secret-api-key-here"
mullama serve --model llama3.2:1b
Clients must include the key in requests:
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:1b", "messages": [{"role": "user", "content": "Hello!"}]}'
TLS (Without Reverse Proxy)¶
For environments without a reverse proxy, the daemon can serve HTTPS directly (when built with TLS support):
mullama serve \
--model llama3.2:1b \
--tls-cert /etc/mullama/cert.pem \
--tls-key /etc/mullama/key.pem \
--http-port 8443
Resource Limits¶
Memory Management¶
# Set maximum memory for the systemd service
# In mullama.service [Service] section:
MemoryMax=32G
MemoryHigh=28G # Trigger memory pressure handling at 28G
# Or via cgroups directly
echo "32G" > /sys/fs/cgroup/mullama/memory.max
CPU Pinning¶
For consistent performance, pin the process to specific CPU cores:
# Using taskset
taskset -c 0-7 mullama serve --model llama3.2:1b --threads 8
# In systemd service
# [Service]
CPUAffinity=0 1 2 3 4 5 6 7
GPU Memory¶
Monitor and limit GPU memory usage:
# Check GPU memory before loading
nvidia-smi --query-gpu=memory.free --format=csv,noheader
# Limit GPU layers based on available VRAM
# ~500 MB per layer for 7B models at Q4_K_M
mullama serve --model qwen2.5:7b --gpu-layers 20 # ~10 GB VRAM
Graceful Shutdown¶
The daemon handles graceful shutdown with the following behavior:
- Stops accepting new requests
- Waits for active requests to complete (up to timeout)
- Unloads all models (frees GPU/CPU memory)
- Closes IPC and HTTP sockets
- Exits with code 0
# Graceful stop (waits for active requests)
mullama daemon stop
# Force stop after 10 seconds
mullama daemon stop --timeout 10
# Immediate kill (not recommended)
mullama daemon stop --force
For systemd, configure appropriate timeouts:
Backup and Recovery¶
Model Cache Backup¶
# Backup model cache
tar -czf mullama-models-backup.tar.gz -C /opt/mullama models/
# Restore
tar -xzf mullama-models-backup.tar.gz -C /opt/mullama/
chown -R mullama:mullama /opt/mullama/models
Session Backup¶
# Backup sessions and custom models
tar -czf mullama-config-backup.tar.gz \
/opt/mullama/models/*.json \
/home/mullama/.mullama/sessions/
# Restore
tar -xzf mullama-config-backup.tar.gz -C /
Disaster Recovery Procedure¶
- Install Mullama binary on new host
- Restore model cache from backup (or re-pull from HuggingFace)
- Restore custom model configurations
- Start the daemon with the same configuration
- Verify health:
curl http://localhost:8080/health - Verify models loaded:
curl http://localhost:8080/api/models
Production Checklist¶
Pre-Deployment Checklist
- [ ] Build with release optimizations:
cargo build --release --features daemon - [ ] Pre-download all required models
- [ ] Configure systemd service with resource limits
- [ ] Set up reverse proxy with TLS termination
- [ ] Configure firewall rules
- [ ] Set up Prometheus metrics scraping
- [ ] Configure alerting rules
- [ ] Set up log rotation
- [ ] Run as non-root user with minimal permissions
- [ ] Test health check endpoint
- [ ] Verify graceful shutdown behavior
- [ ] Set appropriate GPU layers for available VRAM
- [ ] Configure backup schedule for model cache
- [ ] Document API key and distribute to authorized users
- [ ] Load test with expected request patterns
- [ ] Set up monitoring dashboard