Benchmarks¶

This page presents performance data for Mullama across key dimensions: inference throughput, call overhead, memory efficiency, GPU utilization, and batch processing. Where applicable, comparisons are made against HTTP-based inference to quantify the advantages of native bindings.

Benchmark Status

Benchmarks marked with (pending) indicate the measurement framework is in place but final numbers have not yet been published. These will be updated as formal benchmarking is completed. The framework and methodology sections below describe exactly how these measurements will be collected for reproducibility.

Methodology¶

Hardware Reference Configurations¶

Benchmarks are collected across multiple hardware configurations to represent common deployment scenarios:

Configuration	CPU	GPU	RAM	Use Case
Desktop	AMD Ryzen 9 7950X (16C/32T)	NVIDIA RTX 4090 (24GB)	64GB DDR5	Development, high-throughput
Workstation	Intel Xeon w7-2495X (24C/48T)	NVIDIA A6000 (48GB)	128GB DDR5	Production inference
Laptop	Apple M3 Max (16C)	Integrated (40GB unified)	48GB unified	Mobile development
Edge	Raspberry Pi 5	None (CPU only)	8GB	Edge deployment

Measurement Approach¶

Warm-up: 10 inference runs discarded before measurement
Iterations: Minimum 100 runs per data point (1000 for latency measurements)
Timing: std::time::Instant for Rust, process.hrtime() for Node.js, time.perf_counter() for Python
Token counting: Exact token count from tokenizer, not estimated
Statistical reporting: Median (p50), p95, and p99 values reported
Isolation: Single process, no background workloads, CPU governor set to performance

Models Used¶

Model	Parameters	Quantization	Context	Purpose
Llama 3.2 1B Instruct	1.24B	Q4_K_M	4096	Small model baseline
Llama 3.2 3B Instruct	3.21B	Q4_K_M	4096	Medium model
Qwen 2.5 7B Instruct	7.62B	Q4_K_M	8192	Standard model
Llama 3.1 13B	13.02B	Q4_K_M	8192	Large model

Inference Throughput¶

Text Generation (tokens/second)¶

Generation throughput measured with a fixed 128-token prompt and 256-token output.

GPU (RTX 4090)GPU (Apple M3 Max)CPU Only (Ryzen 9 7950X)

Model	Mullama (Native)	Mullama (Daemon)	HTTP Baseline	Notes
Llama 3.2 1B	(pending)	(pending)	(pending)	All layers on GPU
Llama 3.2 3B	(pending)	(pending)	(pending)	All layers on GPU
Qwen 2.5 7B	(pending)	(pending)	(pending)	All layers on GPU
Llama 3.1 13B	(pending)	(pending)	(pending)	All layers on GPU

Model	Mullama (Native)	Mullama (Daemon)	HTTP Baseline	Notes
Llama 3.2 1B	(pending)	(pending)	(pending)	Metal, unified memory
Llama 3.2 3B	(pending)	(pending)	(pending)	Metal, unified memory
Qwen 2.5 7B	(pending)	(pending)	(pending)	Metal, unified memory
Llama 3.1 13B	(pending)	(pending)	(pending)	Metal, unified memory

Model	Mullama (Native)	Mullama (Daemon)	HTTP Baseline	Notes
Llama 3.2 1B	(pending)	(pending)	(pending)	AVX-512, 32 threads
Llama 3.2 3B	(pending)	(pending)	(pending)	AVX-512, 32 threads
Qwen 2.5 7B	(pending)	(pending)	(pending)	AVX-512, 32 threads
Llama 3.1 13B	(pending)	(pending)	(pending)	AVX-512, 32 threads

Throughput Parity

For sustained text generation, throughput (tokens/second) is expected to be nearly identical between native and daemon modes, since the bottleneck is the model computation itself, not the call mechanism. The difference shows up in latency and overhead measurements below.

First Token Latency¶

Time from request initiation to first generated token (Time To First Token, TTFT). This is where call overhead is most visible.

Model	Native Binding	Daemon (HTTP)	Overhead Delta	Notes
Llama 3.2 1B	(pending)	(pending)	(pending)	GPU, warm cache
Llama 3.2 3B	(pending)	(pending)	(pending)	GPU, warm cache
Qwen 2.5 7B	(pending)	(pending)	(pending)	GPU, warm cache
Llama 3.1 13B	(pending)	(pending)	(pending)	GPU, warm cache

Why TTFT Matters

For interactive applications, TTFT determines perceived responsiveness. A native binding eliminates the HTTP round-trip, connection setup, and JSON parsing that add latency before any tokens are generated.

Cold Start vs Warm Start¶

Scenario	Native	Daemon (HTTP)	Notes
Cold start (model load + first inference)	(pending)	(pending)	Includes mmap
Warm start (model cached, first inference)	(pending)	(pending)	Subsequent calls
Hot path (repeated inference, same context)	(pending)	(pending)	Steady state

Binding Overhead Comparison¶

This section isolates the overhead of the call mechanism itself, independent of model computation. Measured by timing the round-trip from application code to the Mullama core and back, with a minimal operation (tokenize a short string).

Measured Call Overhead¶

Binding Type	Median Latency	p95 Latency	p99 Latency	Notes
Rust (native, in-process)	~0.5 us	~1 us	~2 us	Direct function call
C/C++ (FFI)	~1 us	~2 us	~3 us	Minimal FFI boundary
Node.js (NAPI-RS)	~3 us	~5 us	~8 us	Napi thread-safe function
Python (PyO3)	~5 us	~8 us	~12 us	GIL acquisition included
Go (cgo)	~4 us	~7 us	~10 us	Goroutine scheduling
PHP (FFI)	~8 us	~12 us	~18 us	FFI call overhead
HTTP localhost	~1,500 us	~3,000 us	~5,000 us	Full round-trip

The 100-1000x Difference Visualized¶

Call overhead (log scale, microseconds):

Native Rust  |#                                                    | ~0.5 us
C/C++ FFI    |#                                                    | ~1 us
Node.js NAPI |##                                                   | ~3 us
Go cgo       |##                                                   | ~4 us
Python PyO3  |##                                                   | ~5 us
PHP FFI      |###                                                  | ~8 us
HTTP local   |########################################################| ~1,500 us
             0.1     1       10      100     1,000   10,000 us

Key Insight

Native bindings operate at the microsecond level. HTTP operates at the millisecond level. This is a fundamental architectural difference -- not an optimization gap that can be closed through better HTTP implementations. The network stack, serialization, and process boundaries impose inherent costs.

Cumulative Impact¶

The per-call overhead compounds with the number of operations:

Operations	Native (total overhead)	HTTP (total overhead)	Time Saved
1	~3 us	~2 ms	2 ms
10	~30 us	~20 ms	20 ms
100	~300 us	~200 ms	200 ms
1,000	~3 ms	~2,000 ms	2 seconds
10,000	~30 ms	~20,000 ms	20 seconds
100,000	~300 ms	~200,000 ms	3.3 minutes

Memory Efficiency¶

Model Loading with mmap¶

Mullama uses memory-mapped file I/O (mmap) for model loading, which provides significant advantages:

Metric	With mmap	Without mmap	Benefit
Load time (7B model)	(pending)	(pending)	Near-instant "loading"
RSS at load	(pending)	(pending)	Pages loaded on demand
Shared memory (multi-process)	Yes	No	Multiple instances share pages
Swap efficiency	Excellent	Poor	OS manages paging

Memory Usage by Model Size¶

Model	Model File Size	Peak RSS (GPU offload)	Peak RSS (CPU only)
Llama 3.2 1B (Q4_K_M)	~0.7 GB	(pending)	(pending)
Llama 3.2 3B (Q4_K_M)	~1.9 GB	(pending)	(pending)
Qwen 2.5 7B (Q4_K_M)	~4.4 GB	(pending)	(pending)
Llama 3.1 13B (Q4_K_M)	~7.4 GB	(pending)	(pending)

KV Cache Memory¶

Context window size directly impacts KV cache memory. Measured for a 7B model:

Context Size	KV Cache (FP16)	KV Cache (Q8)	KV Cache (Q4)
2048	(pending)	(pending)	(pending)
4096	(pending)	(pending)	(pending)
8192	(pending)	(pending)	(pending)
16384	(pending)	(pending)	(pending)

GPU Utilization¶

Layer Offloading Performance Curve¶

GPU acceleration in Mullama (and llama.cpp generally) works by offloading transformer layers to the GPU. Performance scales with the number of offloaded layers:

Tokens/sec vs GPU Layers Offloaded (7B model, RTX 4090):

tok/s
 |
 |                                          ___________
 |                                     ____/
 |                                ____/
 |                           ____/
 |                      ____/
 |                 ____/
 |            ____/
 |       ____/
 |  ____/
 | /
 |/
 +----+----+----+----+----+----+----+----+
 0    5   10   15   20   25   30   35  All
                GPU Layers

Layers Offloaded	Throughput (tok/s)	VRAM Used	Notes
0 (CPU only)	(pending)	0 GB	Baseline
10 (~30%)	(pending)	(pending)	Partial offload
20 (~60%)	(pending)	(pending)	Majority offloaded
30 (~90%)	(pending)	(pending)	Near-full offload
All (100%)	(pending)	(pending)	Maximum performance

Optimal Layer Count

The optimal number of layers to offload depends on your available VRAM. Mullama's --gpu-layers flag or Modelfile GPU_LAYERS directive lets you tune this. Use mullama show <model> to see the total layer count for a model.

Multi-GPU Scaling¶

For models that exceed single-GPU VRAM, Mullama supports layer splitting across GPUs:

Configuration	7B Model	13B Model	Notes
Single GPU (24GB)	(pending)	(pending)	RTX 4090
Dual GPU (2x24GB)	(pending)	(pending)	NVLink not required

Batch Processing Throughput¶

Rayon Parallel Processing¶

Mullama uses Rayon for CPU-parallel batch operations. Measured processing multiple independent prompts simultaneously:

Batch Size	Sequential (1 thread)	Rayon (8 threads)	Rayon (16 threads)	Rayon (32 threads)
10 prompts	(pending)	(pending)	(pending)	(pending)
50 prompts	(pending)	(pending)	(pending)	(pending)
100 prompts	(pending)	(pending)	(pending)	(pending)
500 prompts	(pending)	(pending)	(pending)	(pending)

Scaling Efficiency¶

Speedup vs Thread Count (100 prompts, Llama 3.2 1B, CPU):

Speedup
  |
8 |                              *
  |                         *
6 |                    *
  |               *
4 |          *
  |     *
2 | *
  |*
1 +--+--+--+--+--+--+--+--+
  1  2  4  6  8  12 16 32
           Threads

* = Measured    --- = Linear (ideal)

Sub-Linear Scaling

Batch processing shows sub-linear scaling due to memory bandwidth limitations and cache contention. The optimal thread count depends on the model size and available memory bandwidth. For most configurations, 8-16 threads provide the best efficiency.

Embedding Generation¶

Embeddings per Second¶

Measured generating embeddings for sentences of varying length:

Input Length	Native Binding	Daemon (HTTP)	Speedup (overhead)	Notes
16 tokens	(pending)	(pending)	(pending)	Short phrases
64 tokens	(pending)	(pending)	(pending)	Sentences
256 tokens	(pending)	(pending)	(pending)	Paragraphs
512 tokens	(pending)	(pending)	(pending)	Documents

Batch Embedding with Parallel Processing¶

Documents	Sequential	Rayon (8 threads)	Throughput
100	(pending)	(pending)	(pending) embeddings/sec
1,000	(pending)	(pending)	(pending) embeddings/sec
10,000	(pending)	(pending)	(pending) embeddings/sec

ColBERT Scoring Performance¶

Late interaction (MaxSim) scoring throughput:

Corpus Size	Sequential	Parallel (Rayon)	Queries/sec
1,000 docs	(pending)	(pending)	(pending)
10,000 docs	(pending)	(pending)	(pending)
100,000 docs	(pending)	(pending)	(pending)

Streaming Performance¶

Token Delivery Latency¶

Time between consecutive tokens reaching the application layer:

Mode	Median Inter-Token	p95 Inter-Token	Jitter	Notes
Native callback	(pending)	(pending)	(pending)	Direct callback
Native channel	(pending)	(pending)	(pending)	Tokio mpsc
WebSocket	(pending)	(pending)	(pending)	Daemon mode
SSE (HTTP)	(pending)	(pending)	(pending)	Daemon mode

Streaming Consistency

Lower jitter means more consistent token delivery, which translates to smoother text rendering in user interfaces. Native callbacks and channels provide the most consistent delivery due to fewer intermediary layers.

Reproducing Benchmarks¶

Running the Benchmark Suite¶

# Clone the repository
git clone https://github.com/cognisoc/mullama.git
cd mullama
git submodule update --init --recursive

# Build with benchmark support
cargo build --release --features "full"

# Run the full benchmark suite
cargo bench

# Run specific benchmark groups
cargo bench -- throughput
cargo bench -- latency
cargo bench -- overhead
cargo bench -- embedding
cargo bench -- batch

Individual Benchmark Scripts¶

Throughput BenchmarkOverhead BenchmarkEmbedding BenchmarkBatch Benchmark

# Measure tokens/second for a specific model
cargo run --release --features "full" --example bench_throughput -- \
  --model path/to/model.gguf \
  --prompt-tokens 128 \
  --generate-tokens 256 \
  --iterations 100 \
  --gpu-layers -1 \
  --warmup 10

# Measure binding call overhead (tokenization round-trip)
cargo run --release --features "full" --example bench_overhead -- \
  --model path/to/model.gguf \
  --iterations 10000 \
  --operation tokenize

# Measure embedding generation throughput
cargo run --release --features "full" --example bench_embeddings -- \
  --model path/to/embedding-model.gguf \
  --corpus path/to/corpus.txt \
  --batch-size 32 \
  --threads 8

# Measure parallel batch processing
cargo run --release --features "full" --example bench_batch -- \
  --model path/to/model.gguf \
  --prompts path/to/prompts.json \
  --threads 1,2,4,8,16,32 \
  --iterations 10

Environment Setup for Reproducible Results¶

# Linux: Set CPU governor to performance
sudo cpupower frequency-set -g performance

# Linux: Disable turbo boost for consistent results
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# Linux: Set GPU to maximum clocks (NVIDIA)
sudo nvidia-smi -pm 1
sudo nvidia-smi --lock-gpu-clocks=2520

# Verify configuration
mullama --version
nvidia-smi  # GPU info
lscpu       # CPU info
free -h     # Memory info

Reporting Your Results¶

We welcome community benchmark contributions. To submit results:

Run benchmarks with the standard configuration above
Include full hardware specifications
Report OS version and kernel
Include Mullama version and llama.cpp backend version
Submit as a GitHub issue with the benchmark label

Benchmark Integrity

All published benchmarks will include full reproduction instructions, raw data, and statistical analysis scripts. We believe in transparent, reproducible performance claims.