Skip to content

Feature Matrix

This page provides a comprehensive view of Mullama's feature status across all dimensions: implementation maturity, platform support, language binding parity, GPU backends, and quantization formats.

Current Version

Mullama v0.3.0 | Built on llama.cpp b7542 | Last updated: January 2026

Feature Status Overview

Core Features

Feature Status Version Notes
Model loading (GGUF) Stable v0.1.0 All GGUF versions supported
Tokenization Stable v0.1.0 SPM, BPE, WPM, UGM, RWKV, PLaMo-2
Text generation Stable v0.1.0 Full pipeline with callbacks
Context management Stable v0.1.0 Create, evaluate, reset
KV cache operations Stable v0.1.0 Save, load, clear, defragment
Session persistence Stable v0.1.0 State save/restore to file
Batch processing Stable v0.1.0 Multi-sequence evaluation
Memory management Stable v0.1.0 RAII, mmap, custom allocators
Model metadata access Stable v0.1.0 All GGUF metadata fields
Automatic batch chunking Stable v0.1.0 Long prompt handling

Streaming

Feature Status Version Notes
Token-by-token callbacks Stable v0.1.0 Synchronous callbacks
Channel-based streaming Stable v0.1.0 Tokio mpsc channels
Backpressure handling Stable v0.1.0 Configurable buffer sizes
Stream cancellation Stable v0.1.0 Graceful stop
SSE (Server-Sent Events) Stable v0.1.0 OpenAI-compatible format
WebSocket streaming Stable v0.1.0 Bidirectional
Streaming configuration Stable v0.1.0 StreamConfig builder

Async Support

Feature Status Version Notes
Tokio runtime integration Stable v0.1.0 Multi-threaded runtime
Async model operations Stable v0.1.0 Non-blocking load/unload
Async generation Stable v0.1.0 Spawn on blocking pool
Async streaming Stable v0.1.0 Stream trait implementation
Task cancellation Stable v0.1.0 CancellationToken support

Sampling Strategies

Feature Status Version Notes
Temperature Stable v0.1.0 Range: 0.0 - 2.0
Top-K Stable v0.1.0 Token count filtering
Top-P (Nucleus) Stable v0.1.0 Cumulative probability
Min-P Stable v0.1.0 Minimum probability threshold
Typical-P Stable v0.1.0 Information-theoretic
Mirostat v1 Stable v0.1.0 Perplexity-controlled
Mirostat v2 Stable v0.1.0 Improved perplexity control
Tail-Free Sampling Stable v0.1.0 Second derivative filtering
Frequency penalty Stable v0.1.0 Token frequency reduction
Presence penalty Stable v0.1.0 Token presence penalty
Repeat penalty Stable v0.1.0 N-gram repetition control
Logit bias Stable v0.1.0 Per-token logit adjustment
Dry sampling Stable v0.1.0 Diversity-promoting
Sampler chains Stable v0.1.0 Composable pipelines
Custom stopping criteria Stable v0.1.0 User-defined stop conditions

Web & API

Feature Status Version Notes
Axum HTTP server Stable v0.1.0 High-performance async
OpenAI API (/v1/chat/completions) Stable v0.1.0 Full compatibility
OpenAI API (/v1/completions) Stable v0.1.0 Text completions
OpenAI API (/v1/embeddings) Stable v0.1.0 Embedding generation
OpenAI API (/v1/models) Stable v0.1.0 Model listing
Anthropic API (/v1/messages) Stable v0.3.0 Messages format
WebSocket support Stable v0.1.0 Real-time bidirectional
Prometheus metrics Stable v0.3.0 /metrics endpoint
Health check Stable v0.1.0 /health endpoint
CORS configuration Stable v0.1.0 Configurable origins

Multimodal

Feature Status Version Notes
MultimodalProcessor Stable v0.1.0 Unified pipeline
Image input (JPEG) Stable v0.1.0 Via image crate
Image input (PNG) Stable v0.1.0 Via image crate
Image input (WebP) Stable v0.1.0 Via image crate
Image format conversion Stable v0.1.0 Between supported formats
Vision projector support Stable v0.1.0 mmproj models
Text + image combined Stable v0.1.0 Interleaved processing

Audio

Feature Status Version Notes
StreamingAudioProcessor Stable v0.1.0 Real-time capture
Voice Activity Detection Stable v0.1.0 Energy + zero-crossing
Noise reduction Stable v0.1.0 Spectral subtraction
Ring buffer architecture Stable v0.1.0 Low-latency processing
WAV format Stable v0.1.0 Read/write
MP3 format Stable v0.1.0 Read (decode)
FLAC format Stable v0.1.0 Read/write
Opus format Stable v0.1.0 Read (decode)
Audio format conversion Stable v0.1.0 Between supported formats
AudioStreamConfig builder Stable v0.1.0 Configurable pipeline

Embeddings & Retrieval

Feature Status Version Notes
Single embedding generation Stable v0.1.0 Per-text embedding
Batch embedding Stable v0.1.0 Multiple texts
Cosine similarity Stable v0.1.0 Standard metric
Dot product similarity Stable v0.1.0 Standard metric
ColBERT multi-vector Stable v0.3.0 Per-token embeddings
MaxSim scoring Stable v0.3.0 Late interaction
Top-k retrieval Stable v0.3.0 Parallel ranking
Token-level analysis Stable v0.3.0 Token similarity maps
Normalized scoring Stable v0.3.0 Unit-normalized vectors
Symmetric scoring Stable v0.3.0 Bidirectional similarity

Daemon & CLI

Feature Status Version Notes
mullama run Stable v0.1.0 One-shot generation
mullama serve Stable v0.1.0 Start HTTP server
mullama pull Stable v0.3.0 Download from registry
mullama list Stable v0.3.0 List available models
mullama show Stable v0.3.0 Model details
mullama create Stable v0.3.0 Create from Modelfile
mullama rm Stable v0.3.0 Remove model
mullama cp Stable v0.3.0 Copy/alias model
mullama ps Stable v0.3.0 Show running models
mullama chat Stable v0.3.0 TUI chat interface
mullama daemon start/stop/status Stable v0.3.0 Lifecycle management
mullama daemon logs Stable v0.3.0 Log viewing
Auto-spawn Stable v0.3.0 Daemon on demand
Model aliases (40+) Stable v0.3.0 Pre-configured
Modelfile support Stable v0.3.0 Ollama-compatible + extensions
Embedded Web UI Stable v0.3.0 Vue.js frontend

Model Adaptation

Feature Status Version Notes
LoRA adapter loading Stable v0.1.0 Single adapter
Multiple LoRA adapters Stable v0.1.0 Simultaneous
Dynamic LoRA scale Stable v0.1.0 Runtime weight adjustment
LoRA metadata access Stable v0.1.0 Parameter info
Control vectors (basic) Beta v0.3.0 Data structures + FFI
Control vectors (API) Beta -- High-level wrapper
Control vector loading Beta -- File format support

Structured Output

Feature Status Version Notes
JSON mode Stable v0.1.0 Force JSON output
GBNF grammar parsing Stable v0.1.0 Full parser
Grammar-constrained sampling Stable v0.1.0 Token-level constraints
Simple pattern grammars Stable v0.1.0 Regex-like patterns
Complex grammar validation Beta -- Edge case testing
Grammar composition Planned v0.2.0 Combine grammars
Dynamic grammar modification Planned v0.2.0 Runtime changes

Advanced Features

Feature Status Version Notes
Parallel processing (Rayon) Stable v0.1.0 Work-stealing scheduler
SIMD optimizations Stable v0.1.0 AVX2, AVX-512, NEON
Flash Attention Stable v0.1.0 Auto/enabled/disabled
Memory-mapped I/O Stable v0.1.0 Efficient model loading
Multi-GPU layer splitting Stable v0.1.0 Across devices
Performance timing Stable v0.1.0 Per-operation metrics
Speculative decoding Planned v0.2.0 Draft model acceleration
Runtime quantization Planned v0.2.0 Dynamic precision
Distributed inference Planned v0.3.0 Multi-node

Platform Support Matrix

Core Library

Feature Linux (x86_64) Linux (ARM64) macOS (Apple Silicon) macOS (Intel) Windows (x86_64)
Model loading
Text generation
Streaming
Async (Tokio)
Embeddings
ColBERT
LoRA adapters
Grammar constraints
Batch processing

Multimodal & Audio

Feature Linux (x86_64) Linux (ARM64) macOS (Apple Silicon) macOS (Intel) Windows (x86_64)
Image processing
Audio capture
Voice activity detection
Audio format conversion
Streaming audio

Web & Daemon

Feature Linux (x86_64) Linux (ARM64) macOS (Apple Silicon) macOS (Intel) Windows (x86_64)
HTTP server
WebSocket
Daemon mode
Web UI
TUI chat
Prometheus metrics
Auto-spawn daemon

Performance Optimizations

Feature Linux (x86_64) Linux (ARM64) macOS (Apple Silicon) macOS (Intel) Windows (x86_64)
AVX2 -- --
AVX-512 -- --
ARM NEON -- -- --
CUDA -- --
Metal -- -- --
ROCm -- -- --
OpenCL
Rayon parallelism
Memory mapping

Language Binding Feature Parity

This table shows which features are available in each language binding. The Rust core always has full feature access.

Feature Rust Node.js Python Go PHP C/C++
Model loading
Text generation
Tokenization
Streaming
Async generation
Embeddings
Batch embedding
ColBERT scoring
Sampling config
Grammar constraints
LoRA adapters
Image input
Audio input
Session save/load
GPU configuration
Performance metrics
Control vectors

Legend: Full support | Partial/limited | In development | Planned

Binding Technology Details

Language Technology Thread Safety Async Model Package
Rust Native Send+Sync Tokio mullama (crate)
Node.js NAPI-RS ThreadsafeFunction Promises/async-await mullama (npm)
Python PyO3 GIL-aware asyncio compatible mullama (pip)
Go cgo goroutine-safe Goroutines mullama-go (module)
PHP FFI Single-thread N/A mullama-php
C/C++ Direct FFI User-managed User-managed libmullama.h

GPU Backend Support

Backend Availability

Backend Linux macOS Windows Min Driver Notes
CUDA -- 525.60+ NVIDIA GPUs
Metal -- -- macOS 13+ Apple Silicon + Intel
ROCm -- 5.5+ AMD GPUs
OpenCL 1.2+ Cross-vendor
Vulkan -- Planned

GPU Feature Support by Backend

Feature CUDA Metal ROCm OpenCL
Full layer offloading
Partial layer offloading
Multi-GPU --
Flash Attention
FP16 compute
BF16 compute
INT8 tensor cores -- --

Environment Variables for GPU

# Enable specific backends (set before building)
export LLAMA_CUDA=1        # NVIDIA CUDA
export LLAMA_METAL=1       # Apple Metal
export LLAMA_HIPBLAS=1     # AMD ROCm (HIP)
export LLAMA_CLBLAST=1     # OpenCL (CLBlast)

# CUDA-specific
export CUDA_VISIBLE_DEVICES=0,1    # Select GPUs
export LLAMA_CUDA_F16=1            # Force FP16

# Metal-specific (auto-detected on macOS)
# No additional configuration needed

# ROCm-specific
export HIP_VISIBLE_DEVICES=0,1     # Select GPUs

Quantization Format Support

Mullama supports all GGUF quantization types provided by llama.cpp:

Integer Quantization

Format Bits/Weight Quality Speed Memory (7B) Notes
Q2_K 2.63 Low Fastest ~2.6 GB Extreme compression
Q3_K_S 3.44 Low-Med Fast ~3.0 GB Small K-quant
Q3_K_M 3.91 Medium Fast ~3.3 GB Medium K-quant
Q3_K_L 4.27 Med-High Fast ~3.6 GB Large K-quant
Q4_0 4.50 Medium Fast ~3.8 GB Legacy, uniform
Q4_1 5.00 Medium Fast ~4.2 GB Legacy, with offset
Q4_K_S 4.58 Med-High Fast ~3.9 GB Small K-quant
Q4_K_M 4.85 High Balanced ~4.1 GB Recommended default
Q5_0 5.50 High Balanced ~4.6 GB Legacy, uniform
Q5_1 6.00 High Balanced ~5.0 GB Legacy, with offset
Q5_K_S 5.54 High Balanced ~4.7 GB Small K-quant
Q5_K_M 5.69 Very High Balanced ~4.8 GB Medium K-quant
Q6_K 6.56 Very High Slower ~5.5 GB Highest integer quality
Q8_0 8.50 Near-FP16 Slower ~7.1 GB High quality

Floating Point Formats

Format Bits/Weight Quality Speed Memory (7B) Notes
F16 16.00 Reference Slow ~13.4 GB Half precision
F32 32.00 Maximum Slowest ~26.8 GB Full precision
BF16 16.00 Reference Slow ~13.4 GB Brain floating point

I-Quant Formats

Format Bits/Weight Quality Speed Memory (7B) Notes
IQ1_S 1.56 Very Low Fastest ~1.6 GB Extreme compression
IQ1_M 1.75 Low Fastest ~1.8 GB Improved 1-bit
IQ2_XXS 2.06 Low Very Fast ~2.0 GB Ultra-small
IQ2_XS 2.31 Low Very Fast ~2.2 GB Extra-small
IQ2_S 2.50 Low-Med Very Fast ~2.4 GB Small
IQ2_M 2.70 Medium Fast ~2.5 GB Medium
IQ3_XXS 3.06 Medium Fast ~2.8 GB Ultra-small 3-bit
IQ3_XS 3.30 Medium Fast ~3.0 GB Extra-small 3-bit
IQ3_S 3.44 Med-High Fast ~3.2 GB Small 3-bit
IQ3_M 3.70 Med-High Fast ~3.4 GB Medium 3-bit
IQ4_NL 4.50 High Balanced ~3.8 GB Non-linear 4-bit
IQ4_XS 4.25 High Balanced ~3.6 GB Extra-small 4-bit

Choosing a Quantization

For most use cases, Q4_K_M provides the best balance of quality, speed, and memory usage. Use Q5_K_M or Q6_K when quality is paramount. Use Q2_K or IQ2 variants for extremely memory-constrained environments (mobile, edge devices).

Quantization Selection Guide

Quality vs Memory Tradeoff:

Quality
  |
  |  F16/F32 ●
  |
  |          Q8_0 ●
  |       Q6_K ●
  |     Q5_K_M ●
  |    Q5_K_S ●
  |   Q4_K_M ●  ← Recommended
  |  Q4_K_S ●
  | Q3_K_M ●
  |Q3_K_S ●
  |Q2_K ●
  +──────────────────────── Memory Usage
  Low                      High

Feature Flags (Cargo)

Mullama uses Cargo feature flags to control compilation. Features can be combined:

Feature Description Dependencies
async Tokio async runtime support tokio
streaming Real-time token streaming --
multimodal Image processing pipeline image crate
streaming-audio Real-time audio capture + VAD multimodal, audio libs
format-conversion Audio/image format conversion multimodal
web Axum HTTP server framework async, axum
websockets WebSocket support web, tokio-tungstenite
parallel Rayon batch processing rayon
daemon CLI daemon mode web, websockets
embedded-ui Embedded Vue.js Web UI daemon
full All features enabled all of the above

Feature Dependency Chain

full
├── daemon
│   ├── embedded-ui
│   ├── web
│   │   ├── async
│   │   └── websockets
│   └── parallel
├── streaming-audio
│   └── multimodal
├── format-conversion
│   └── multimodal
└── streaming

Build Examples

# Minimal (core inference only)
cargo build --release --no-default-features

# Library with async and streaming
cargo build --release --features "async,streaming"

# Multimodal with audio
cargo build --release --features "multimodal,streaming-audio"

# Web server
cargo build --release --features "web,websockets"

# Full daemon with Web UI
cargo build --release --features "daemon,embedded-ui"

# Everything
cargo build --release --features full