Skip to content

Edge Deployment

Deploy Mullama on resource-constrained devices for offline, private AI inference. This tutorial covers hardware selection, model optimization, memory management, service configuration, and monitoring.


What You'll Build

An edge inference system that:

  • Runs quantized models on limited hardware (2-8 GB RAM)
  • Optimizes CPU-only inference with tuned thread counts
  • Minimizes memory usage with small contexts and KV cache quantization
  • Starts automatically via systemd service
  • Serves an API for local network access
  • Monitors resource usage on constrained hardware

Prerequisites

  • A supported edge device (see hardware table below)
  • Linux (Ubuntu/Debian or Raspberry Pi OS)
  • Python 3.8+ (pip install mullama)
  • A small quantized GGUF model (Q4_K_M or Q2_K)

Supported Hardware

Device RAM CPU Expected Speed Best Model Size
Raspberry Pi 5 8 GB Cortex-A76 (4c) 3-8 tok/s 1B-3B Q4_K_M
Raspberry Pi 4 4-8 GB Cortex-A72 (4c) 1-4 tok/s 1B Q4_K_M
Jetson Nano 4 GB Cortex-A57 (4c) + GPU 5-15 tok/s 1B-3B Q4_K_M
Intel NUC i5 16 GB i5-1240P (12c) 15-30 tok/s 3B-7B Q4_K_M
Intel NUC i3 8 GB i3-1115G4 (4c) 8-15 tok/s 1B-3B Q4_K_M
Orange Pi 5 8 GB RK3588 (8c) 4-10 tok/s 1B-3B Q4_K_M

Step 1: Model Selection for Edge

Choose the right model size and quantization for your hardware.

Quantization Levels

Quantization Size (1B model) Size (3B model) Quality Speed
Q2_K 0.4 GB 1.2 GB Low Fastest
Q3_K_M 0.5 GB 1.5 GB Medium-Low Fast
Q4_K_M 0.7 GB 2.0 GB Medium Good
Q5_K_M 0.8 GB 2.4 GB Medium-High Moderate
Q6_K 0.9 GB 2.8 GB High Slower
Q8_0 1.1 GB 3.3 GB Very High Slowest

Model Size Guidelines

Available RAM - OS overhead (1 GB) - Context memory = Max model size

Example: Raspberry Pi 5 (8 GB)
  8 GB - 1 GB (OS) - 0.5 GB (context) = 6.5 GB max model
  Recommended: 1B-3B models in Q4_K_M (0.7-2.0 GB)
# Download a small, optimized model for edge
# Option 1: Via daemon
# mullama pull llama3.2:1b

# Option 2: Direct download (if daemon not available)
import urllib.request
model_url = "https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf"
model_path = "/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf"

print(f"Downloading model to {model_path}...")
urllib.request.urlretrieve(model_url, model_path)
print("Done!")
# Create model directory
sudo mkdir -p /opt/mullama/models

# Download a small model (Q4_K_M quantization, ~700 MB for 1B model)
wget -O /opt/mullama/models/llama3.2-1b-Q4_K_M.gguf \
    "https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf"

# For extremely constrained devices, use Q2_K (~400 MB)
wget -O /opt/mullama/models/llama3.2-1b-Q2_K.gguf \
    "https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q2_K.gguf"

Step 2: Memory Optimization

Configure Mullama for minimal memory footprint.

from mullama import Model, Context, SamplerParams

# Edge-optimized model loading
model = Model.load(
    "/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf",
    n_gpu_layers=0,      # CPU only (no GPU on most edge devices)
    use_mmap=True,        # Memory-map the model file (reduces RSS)
    use_mlock=False,      # Don't lock in RAM (allow OS to page)
)

# Small context to minimize KV cache memory
ctx = Context(
    model,
    n_ctx=512,            # Small context window (saves ~200 MB vs 4096)
    n_batch=256,          # Smaller batch for lower memory peak
    n_threads=4,          # Match physical core count
)

print(f"Model: {model.name}")
print(f"Size: {model.size / 1e6:.0f} MB")
print(f"Context: {ctx.n_ctx} tokens")
const { JsModel, JsContext } = require('mullama');

// Edge-optimized model loading
const model = JsModel.load('/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf', {
    nGpuLayers: 0,       // CPU only
    useMmap: true,        // Memory-map the model file
    useMlock: false,      // Don't lock in RAM
});

// Small context to minimize KV cache memory
const ctx = new JsContext(model, {
    nCtx: 512,            // Small context window
    nBatch: 256,          // Smaller batch for lower memory peak
    nThreads: 4,          // Match physical core count
});

console.log(`Model: ${model.name}`);
console.log(`Size: ${(model.size / 1e6).toFixed(0)} MB`);
console.log(`Context: ${ctx.nCtx} tokens`);

Memory Budget Breakdown

Component 512 ctx 2048 ctx 4096 ctx
Model (1B Q4) 700 MB 700 MB 700 MB
KV Cache ~50 MB ~200 MB ~400 MB
Working memory ~50 MB ~100 MB ~150 MB
Total ~800 MB ~1000 MB ~1250 MB

Step 3: CPU Optimization

Tune inference for CPU-bound execution.

import os
import multiprocessing

def get_optimal_threads():
    """Determine optimal thread count for edge device."""
    physical_cores = multiprocessing.cpu_count()

    # Use physical cores only (not hyperthreads)
    # On ARM (RPi), all cores are physical
    # On x86, divide by 2 for hyperthreading
    import platform
    if platform.machine() in ('x86_64', 'AMD64'):
        physical_cores = max(1, physical_cores // 2)

    return physical_cores

def create_edge_context(model, context_size=512):
    """Create an optimized context for edge inference."""
    n_threads = get_optimal_threads()

    ctx = Context(
        model,
        n_ctx=context_size,
        n_batch=min(context_size, 256),  # Batch <= context
        n_threads=n_threads,
    )

    print(f"Edge context: {context_size} ctx, {n_threads} threads, "
          f"batch={min(context_size, 256)}")
    return ctx

# Usage
ctx = create_edge_context(model, context_size=512)

Thread Count Guidelines

Device Physical Cores Recommended Threads Notes
RPi 5 4 4 All cores, no HT
RPi 4 4 4 All cores, no HT
Jetson Nano 4 4 CPU cores only
Intel NUC i5 4P+8E 4-8 Performance cores preferred
Intel NUC i3 2+2HT 2 Physical cores only

Over-threading

Setting threads higher than physical core count usually hurts performance on edge devices due to cache thrashing and context switching overhead. Always benchmark with different thread counts.


Step 4: Systemd Service

Set up automatic start on boot.

# Create a dedicated user
sudo useradd -r -s /bin/false mullama

# Set ownership
sudo chown -R mullama:mullama /opt/mullama

# Create the service file
sudo tee /etc/systemd/system/mullama-edge.service << 'EOF'
[Unit]
Description=Mullama Edge Inference Server
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=mullama
Group=mullama
WorkingDirectory=/opt/mullama

# Environment
Environment=MODEL_PATH=/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf
Environment=PORT=8080
Environment=N_CTX=512
Environment=N_THREADS=4

# Run the server
ExecStart=/usr/bin/python3 /opt/mullama/server.py

# Resource limits
MemoryMax=2G
CPUQuota=100%

# Restart on failure
Restart=on-failure
RestartSec=10

# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/opt/mullama/logs

[Install]
WantedBy=multi-user.target
EOF

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable mullama-edge
sudo systemctl start mullama-edge

# Check status
sudo systemctl status mullama-edge

Step 5: Edge API Server

A lightweight API server optimized for edge hardware.

#!/usr/bin/env python3
"""Mullama Edge API Server - optimized for resource-constrained devices."""

import os, time, json
from http.server import HTTPServer, BaseHTTPRequestHandler
from mullama import Model, Context, SamplerParams

# --- Configuration from environment ---
MODEL_PATH = os.environ.get("MODEL_PATH", "/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf")
PORT = int(os.environ.get("PORT", 8080))
N_CTX = int(os.environ.get("N_CTX", 512))
N_THREADS = int(os.environ.get("N_THREADS", 4))

# --- Load model ---
print(f"Loading model: {MODEL_PATH}")
model = Model.load(MODEL_PATH, n_gpu_layers=0, use_mmap=True)
ctx = Context(model, n_ctx=N_CTX, n_batch=256, n_threads=N_THREADS)
print(f"Ready: {model.name} | ctx={N_CTX} | threads={N_THREADS}")

class EdgeHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/health":
            self.send_json(200, {
                "status": "ok", "model": model.name or "unknown",
                "context_size": N_CTX, "threads": N_THREADS,
            })
        else:
            self.send_json(404, {"error": "Not found"})

    def do_POST(self):
        if self.path == "/generate":
            length = int(self.headers.get("Content-Length", 0))
            body = json.loads(self.rfile.read(length)) if length else {}

            prompt = body.get("prompt", "")
            max_tokens = min(body.get("max_tokens", 100), N_CTX - 50)
            temperature = body.get("temperature", 0.7)

            if not prompt:
                return self.send_json(400, {"error": "prompt required"})

            start = time.time()
            params = SamplerParams(temperature=temperature)
            text = ctx.generate(prompt, max_tokens=max_tokens, params=params)
            elapsed = time.time() - start
            tokens = len(model.tokenize(text, add_bos=False))
            ctx.clear_cache()

            self.send_json(200, {
                "text": text.strip(),
                "tokens": tokens,
                "time_ms": int(elapsed * 1000),
                "tokens_per_sec": tokens / elapsed if elapsed > 0 else 0,
            })
        else:
            self.send_json(404, {"error": "Not found"})

    def send_json(self, code, data):
        self.send_response(code)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

    def log_message(self, format, *args):
        pass  # Suppress default logging for performance

if __name__ == "__main__":
    server = HTTPServer(("0.0.0.0", PORT), EdgeHandler)
    print(f"Edge server listening on port {PORT}")
    try:
        server.serve_forever()
    except KeyboardInterrupt:
        print("\nShutting down...")
const http = require('http');
const { JsModel, JsContext } = require('mullama');

const MODEL_PATH = process.env.MODEL_PATH || '/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf';
const PORT = parseInt(process.env.PORT || '8080');
const N_CTX = parseInt(process.env.N_CTX || '512');
const N_THREADS = parseInt(process.env.N_THREADS || '4');

console.log(`Loading model: ${MODEL_PATH}`);
const model = JsModel.load(MODEL_PATH, { nGpuLayers: 0, useMmap: true });
const ctx = new JsContext(model, { nCtx: N_CTX, nBatch: 256, nThreads: N_THREADS });
console.log(`Ready: ${model.name} | ctx=${N_CTX} | threads=${N_THREADS}`);

const server = http.createServer((req, res) => {
    if (req.method === 'GET' && req.url === '/health') {
        res.writeHead(200, { 'Content-Type': 'application/json' });
        res.end(JSON.stringify({ status: 'ok', model: model.name, context: N_CTX }));
        return;
    }

    if (req.method === 'POST' && req.url === '/generate') {
        let body = '';
        req.on('data', chunk => body += chunk);
        req.on('end', () => {
            const { prompt, max_tokens = 100, temperature = 0.7 } = JSON.parse(body || '{}');
            if (!prompt) {
                res.writeHead(400, { 'Content-Type': 'application/json' });
                res.end(JSON.stringify({ error: 'prompt required' }));
                return;
            }

            const start = Date.now();
            const text = ctx.generate(prompt, Math.min(max_tokens, N_CTX - 50), { temperature });
            const elapsed = (Date.now() - start) / 1000;
            const tokens = model.tokenize(text, false).length;
            ctx.clearCache();

            res.writeHead(200, { 'Content-Type': 'application/json' });
            res.end(JSON.stringify({
                text: text.trim(), tokens, time_ms: Math.round(elapsed * 1000),
                tokens_per_sec: tokens / elapsed
            }));
        });
        return;
    }

    res.writeHead(404);
    res.end('Not found');
});

server.listen(PORT, () => console.log(`Edge server on port ${PORT}`));

Step 6: Power and Thermal Considerations

Edge devices have power and thermal limits. Configure accordingly.

Device TDP Idle Power Inference Power Notes
RPi 5 12W 3W 8-10W Active cooling recommended
RPi 4 6W 2.5W 5-6W Passive cooling sufficient
Jetson Nano 10W 3W 8-10W 5W mode available
Intel NUC 28W 5W 15-25W Fan-cooled
# Monitor CPU temperature (Linux)
watch -n 1 cat /sys/class/thermal/thermal_zone0/temp

# Limit CPU frequency if overheating (RPi)
echo 1500000 | sudo tee /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq

# Monitor power usage (if supported)
vcgencmd measure_volts core
vcgencmd get_throttled

Step 7: Monitoring

Monitor resource usage on constrained hardware.

import os, time, threading

class EdgeMonitor:
    def __init__(self, interval=5):
        self.interval = interval
        self.running = False
        self._thread = None

    def start(self):
        self.running = True
        self._thread = threading.Thread(target=self._monitor_loop, daemon=True)
        self._thread.start()

    def stop(self):
        self.running = False

    def _monitor_loop(self):
        while self.running:
            stats = self.get_stats()
            if stats["memory_percent"] > 85:
                print(f"WARNING: Memory at {stats['memory_percent']:.0f}%")
            if stats.get("cpu_temp", 0) > 80:
                print(f"WARNING: CPU temp at {stats['cpu_temp']:.0f}C")
            time.sleep(self.interval)

    def get_stats(self):
        # Memory
        with open("/proc/meminfo") as f:
            meminfo = dict(line.split(":") for line in f.read().strip().split("\n"))
        total = int(meminfo["MemTotal"].strip().split()[0]) / 1024
        available = int(meminfo["MemAvailable"].strip().split()[0]) / 1024
        used = total - available

        stats = {
            "memory_total_mb": total,
            "memory_used_mb": used,
            "memory_percent": used / total * 100,
        }

        # CPU temperature
        try:
            with open("/sys/class/thermal/thermal_zone0/temp") as f:
                stats["cpu_temp"] = int(f.read().strip()) / 1000
        except FileNotFoundError:
            pass

        # Load average
        load1, load5, load15 = os.getloadavg()
        stats["load_1m"] = load1
        stats["load_5m"] = load5

        return stats

# Usage
monitor = EdgeMonitor(interval=10)
monitor.start()
#!/bin/bash
# edge-monitor.sh - Simple resource monitoring for edge devices

while true; do
    TIMESTAMP=$(date +%H:%M:%S)
    MEM_USED=$(free -m | awk '/Mem:/ {print $3}')
    MEM_TOTAL=$(free -m | awk '/Mem:/ {print $2}')
    MEM_PCT=$((MEM_USED * 100 / MEM_TOTAL))
    LOAD=$(cat /proc/loadavg | cut -d' ' -f1)

    # CPU temperature (RPi/ARM)
    TEMP="N/A"
    if [ -f /sys/class/thermal/thermal_zone0/temp ]; then
        TEMP=$(echo "scale=1; $(cat /sys/class/thermal/thermal_zone0/temp)/1000" | bc)
    fi

    echo "[$TIMESTAMP] Mem: ${MEM_USED}/${MEM_TOTAL}MB (${MEM_PCT}%) | Load: ${LOAD} | Temp: ${TEMP}C"

    # Alert on high usage
    if [ $MEM_PCT -gt 90 ]; then
        echo "ALERT: Memory critical!"
    fi

    sleep 5
done

Complete Deployment Script

#!/bin/bash
# deploy-edge.sh - Complete edge deployment setup

set -e

echo "=== Mullama Edge Deployment ==="

# 1. Install dependencies
echo "[1/6] Installing dependencies..."
sudo apt update -qq
sudo apt install -y python3 python3-pip
pip3 install mullama --break-system-packages 2>/dev/null || pip3 install mullama

# 2. Create directories
echo "[2/6] Setting up directories..."
sudo mkdir -p /opt/mullama/{models,logs}

# 3. Download model
MODEL_URL="${MODEL_URL:-https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf}"
MODEL_PATH="/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf"

if [ ! -f "$MODEL_PATH" ]; then
    echo "[3/6] Downloading model (this may take a while)..."
    wget -q --show-progress -O "$MODEL_PATH" "$MODEL_URL"
else
    echo "[3/6] Model already exists, skipping download."
fi

# 4. Deploy server script
echo "[4/6] Deploying server..."
cat > /opt/mullama/server.py << 'PYEOF'
#!/usr/bin/env python3
import os, time, json
from http.server import HTTPServer, BaseHTTPRequestHandler
from mullama import Model, Context, SamplerParams

MODEL_PATH = os.environ.get("MODEL_PATH", "/opt/mullama/models/llama3.2-1b-Q4_K_M.gguf")
PORT = int(os.environ.get("PORT", 8080))
N_CTX = int(os.environ.get("N_CTX", 512))
N_THREADS = int(os.environ.get("N_THREADS", 4))

model = Model.load(MODEL_PATH, n_gpu_layers=0, use_mmap=True)
ctx = Context(model, n_ctx=N_CTX, n_batch=256, n_threads=N_THREADS)
print(f"Ready: {model.name} | ctx={N_CTX} | threads={N_THREADS} | port={PORT}")

class Handler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/health":
            self.respond(200, {"status": "ok", "model": model.name})
        else:
            self.respond(404, {"error": "not found"})
    def do_POST(self):
        if self.path == "/generate":
            body = json.loads(self.rfile.read(int(self.headers.get("Content-Length", 0))))
            prompt = body.get("prompt", "")
            if not prompt:
                return self.respond(400, {"error": "prompt required"})
            start = time.time()
            text = ctx.generate(prompt, max_tokens=min(body.get("max_tokens", 100), N_CTX-50),
                                params=SamplerParams(temperature=body.get("temperature", 0.7)))
            elapsed = time.time() - start
            ctx.clear_cache()
            self.respond(200, {"text": text.strip(), "time_ms": int(elapsed*1000)})
        else:
            self.respond(404, {"error": "not found"})
    def respond(self, code, data):
        self.send_response(code)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())
    def log_message(self, *args): pass

HTTPServer(("0.0.0.0", PORT), Handler).serve_forever()
PYEOF

# 5. Create systemd service
echo "[5/6] Creating systemd service..."
sudo tee /etc/systemd/system/mullama-edge.service > /dev/null << EOF
[Unit]
Description=Mullama Edge Server
After=network.target

[Service]
Type=simple
User=$(whoami)
Environment=MODEL_PATH=$MODEL_PATH
Environment=PORT=8080
Environment=N_CTX=512
Environment=N_THREADS=$(nproc)
ExecStart=/usr/bin/python3 /opt/mullama/server.py
Restart=on-failure
RestartSec=10
MemoryMax=2G

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable mullama-edge
sudo systemctl start mullama-edge

# 6. Verify
echo "[6/6] Verifying deployment..."
sleep 3
if curl -s http://localhost:8080/health | grep -q "ok"; then
    echo ""
    echo "=== Deployment successful! ==="
    echo "  Health: curl http://localhost:8080/health"
    echo "  Generate: curl -X POST http://localhost:8080/generate -H 'Content-Type: application/json' -d '{\"prompt\": \"Hello!\"}'"
    echo "  Logs: journalctl -u mullama-edge -f"
else
    echo "ERROR: Server not responding. Check: journalctl -u mullama-edge"
    exit 1
fi

Latency Expectations

Realistic performance for different configurations:

Config Prompt (50 tok) Generation (100 tok) Total
RPi 5, 1B Q4 ~2s ~15s ~17s
RPi 5, 1B Q2 ~1.5s ~12s ~13.5s
Jetson Nano, 1B Q4 ~1s ~10s ~11s
NUC i5, 3B Q4 ~1s ~5s ~6s
NUC i5, 1B Q4 ~0.5s ~3s ~3.5s

Reducing Latency

  • Use Q2_K for faster inference (slight quality trade-off)
  • Reduce max_tokens to minimum needed
  • Keep prompts short (fewer tokens to process)
  • Use greedy sampling (temperature=0) to avoid sampling overhead
  • Pre-warm the model with a dummy generation on startup

What's Next