Feature Matrix
This page provides a comprehensive view of Mullama's feature status across all dimensions: implementation maturity, platform support, language binding parity, GPU backends, and quantization formats.
Current Version
Mullama v0.3.0 | Built on llama.cpp b7542 | Last updated: January 2026
Feature Status Overview
Core Features
Feature
Status
Version
Notes
Model loading (GGUF)
Stable
v0.1.0
All GGUF versions supported
Tokenization
Stable
v0.1.0
SPM, BPE, WPM, UGM, RWKV, PLaMo-2
Text generation
Stable
v0.1.0
Full pipeline with callbacks
Context management
Stable
v0.1.0
Create, evaluate, reset
KV cache operations
Stable
v0.1.0
Save, load, clear, defragment
Session persistence
Stable
v0.1.0
State save/restore to file
Batch processing
Stable
v0.1.0
Multi-sequence evaluation
Memory management
Stable
v0.1.0
RAII, mmap, custom allocators
Model metadata access
Stable
v0.1.0
All GGUF metadata fields
Automatic batch chunking
Stable
v0.1.0
Long prompt handling
Streaming
Feature
Status
Version
Notes
Token-by-token callbacks
Stable
v0.1.0
Synchronous callbacks
Channel-based streaming
Stable
v0.1.0
Tokio mpsc channels
Backpressure handling
Stable
v0.1.0
Configurable buffer sizes
Stream cancellation
Stable
v0.1.0
Graceful stop
SSE (Server-Sent Events)
Stable
v0.1.0
OpenAI-compatible format
WebSocket streaming
Stable
v0.1.0
Bidirectional
Streaming configuration
Stable
v0.1.0
StreamConfig builder
Async Support
Feature
Status
Version
Notes
Tokio runtime integration
Stable
v0.1.0
Multi-threaded runtime
Async model operations
Stable
v0.1.0
Non-blocking load/unload
Async generation
Stable
v0.1.0
Spawn on blocking pool
Async streaming
Stable
v0.1.0
Stream trait implementation
Task cancellation
Stable
v0.1.0
CancellationToken support
Sampling Strategies
Feature
Status
Version
Notes
Temperature
Stable
v0.1.0
Range: 0.0 - 2.0
Top-K
Stable
v0.1.0
Token count filtering
Top-P (Nucleus)
Stable
v0.1.0
Cumulative probability
Min-P
Stable
v0.1.0
Minimum probability threshold
Typical-P
Stable
v0.1.0
Information-theoretic
Mirostat v1
Stable
v0.1.0
Perplexity-controlled
Mirostat v2
Stable
v0.1.0
Improved perplexity control
Tail-Free Sampling
Stable
v0.1.0
Second derivative filtering
Frequency penalty
Stable
v0.1.0
Token frequency reduction
Presence penalty
Stable
v0.1.0
Token presence penalty
Repeat penalty
Stable
v0.1.0
N-gram repetition control
Logit bias
Stable
v0.1.0
Per-token logit adjustment
Dry sampling
Stable
v0.1.0
Diversity-promoting
Sampler chains
Stable
v0.1.0
Composable pipelines
Custom stopping criteria
Stable
v0.1.0
User-defined stop conditions
Web & API
Feature
Status
Version
Notes
Axum HTTP server
Stable
v0.1.0
High-performance async
OpenAI API (/v1/chat/completions)
Stable
v0.1.0
Full compatibility
OpenAI API (/v1/completions)
Stable
v0.1.0
Text completions
OpenAI API (/v1/embeddings)
Stable
v0.1.0
Embedding generation
OpenAI API (/v1/models)
Stable
v0.1.0
Model listing
Anthropic API (/v1/messages)
Stable
v0.3.0
Messages format
WebSocket support
Stable
v0.1.0
Real-time bidirectional
Prometheus metrics
Stable
v0.3.0
/metrics endpoint
Health check
Stable
v0.1.0
/health endpoint
CORS configuration
Stable
v0.1.0
Configurable origins
Multimodal
Feature
Status
Version
Notes
MultimodalProcessor
Stable
v0.1.0
Unified pipeline
Image input (JPEG)
Stable
v0.1.0
Via image crate
Image input (PNG)
Stable
v0.1.0
Via image crate
Image input (WebP)
Stable
v0.1.0
Via image crate
Image format conversion
Stable
v0.1.0
Between supported formats
Vision projector support
Stable
v0.1.0
mmproj models
Text + image combined
Stable
v0.1.0
Interleaved processing
Audio
Feature
Status
Version
Notes
StreamingAudioProcessor
Stable
v0.1.0
Real-time capture
Voice Activity Detection
Stable
v0.1.0
Energy + zero-crossing
Noise reduction
Stable
v0.1.0
Spectral subtraction
Ring buffer architecture
Stable
v0.1.0
Low-latency processing
WAV format
Stable
v0.1.0
Read/write
MP3 format
Stable
v0.1.0
Read (decode)
FLAC format
Stable
v0.1.0
Read/write
Opus format
Stable
v0.1.0
Read (decode)
Audio format conversion
Stable
v0.1.0
Between supported formats
AudioStreamConfig builder
Stable
v0.1.0
Configurable pipeline
Embeddings & Retrieval
Feature
Status
Version
Notes
Single embedding generation
Stable
v0.1.0
Per-text embedding
Batch embedding
Stable
v0.1.0
Multiple texts
Cosine similarity
Stable
v0.1.0
Standard metric
Dot product similarity
Stable
v0.1.0
Standard metric
ColBERT multi-vector
Stable
v0.3.0
Per-token embeddings
MaxSim scoring
Stable
v0.3.0
Late interaction
Top-k retrieval
Stable
v0.3.0
Parallel ranking
Token-level analysis
Stable
v0.3.0
Token similarity maps
Normalized scoring
Stable
v0.3.0
Unit-normalized vectors
Symmetric scoring
Stable
v0.3.0
Bidirectional similarity
Daemon & CLI
Feature
Status
Version
Notes
mullama run
Stable
v0.1.0
One-shot generation
mullama serve
Stable
v0.1.0
Start HTTP server
mullama pull
Stable
v0.3.0
Download from registry
mullama list
Stable
v0.3.0
List available models
mullama show
Stable
v0.3.0
Model details
mullama create
Stable
v0.3.0
Create from Modelfile
mullama rm
Stable
v0.3.0
Remove model
mullama cp
Stable
v0.3.0
Copy/alias model
mullama ps
Stable
v0.3.0
Show running models
mullama chat
Stable
v0.3.0
TUI chat interface
mullama daemon start/stop/status
Stable
v0.3.0
Lifecycle management
mullama daemon logs
Stable
v0.3.0
Log viewing
Auto-spawn
Stable
v0.3.0
Daemon on demand
Model aliases (40+)
Stable
v0.3.0
Pre-configured
Modelfile support
Stable
v0.3.0
Ollama-compatible + extensions
Embedded Web UI
Stable
v0.3.0
Vue.js frontend
Model Adaptation
Feature
Status
Version
Notes
LoRA adapter loading
Stable
v0.1.0
Single adapter
Multiple LoRA adapters
Stable
v0.1.0
Simultaneous
Dynamic LoRA scale
Stable
v0.1.0
Runtime weight adjustment
LoRA metadata access
Stable
v0.1.0
Parameter info
Control vectors (basic)
Beta
v0.3.0
Data structures + FFI
Control vectors (API)
Beta
--
High-level wrapper
Control vector loading
Beta
--
File format support
Structured Output
Feature
Status
Version
Notes
JSON mode
Stable
v0.1.0
Force JSON output
GBNF grammar parsing
Stable
v0.1.0
Full parser
Grammar-constrained sampling
Stable
v0.1.0
Token-level constraints
Simple pattern grammars
Stable
v0.1.0
Regex-like patterns
Complex grammar validation
Beta
--
Edge case testing
Grammar composition
Planned
v0.2.0
Combine grammars
Dynamic grammar modification
Planned
v0.2.0
Runtime changes
Advanced Features
Feature
Status
Version
Notes
Parallel processing (Rayon)
Stable
v0.1.0
Work-stealing scheduler
SIMD optimizations
Stable
v0.1.0
AVX2, AVX-512, NEON
Flash Attention
Stable
v0.1.0
Auto/enabled/disabled
Memory-mapped I/O
Stable
v0.1.0
Efficient model loading
Multi-GPU layer splitting
Stable
v0.1.0
Across devices
Performance timing
Stable
v0.1.0
Per-operation metrics
Speculative decoding
Planned
v0.2.0
Draft model acceleration
Runtime quantization
Planned
v0.2.0
Dynamic precision
Distributed inference
Planned
v0.3.0
Multi-node
Core Library
Feature
Linux (x86_64)
Linux (ARM64)
macOS (Apple Silicon)
macOS (Intel)
Windows (x86_64)
Model loading
Text generation
Streaming
Async (Tokio)
Embeddings
ColBERT
LoRA adapters
Grammar constraints
Batch processing
Multimodal & Audio
Feature
Linux (x86_64)
Linux (ARM64)
macOS (Apple Silicon)
macOS (Intel)
Windows (x86_64)
Image processing
Audio capture
Voice activity detection
Audio format conversion
Streaming audio
Web & Daemon
Feature
Linux (x86_64)
Linux (ARM64)
macOS (Apple Silicon)
macOS (Intel)
Windows (x86_64)
HTTP server
WebSocket
Daemon mode
Web UI
TUI chat
Prometheus metrics
Auto-spawn daemon
Feature
Linux (x86_64)
Linux (ARM64)
macOS (Apple Silicon)
macOS (Intel)
Windows (x86_64)
AVX2
--
--
AVX-512
--
--
ARM NEON
--
--
--
CUDA
--
--
Metal
--
--
--
ROCm
--
--
--
OpenCL
Rayon parallelism
Memory mapping
Language Binding Feature Parity
This table shows which features are available in each language binding. The Rust core always has full feature access.
Feature
Rust
Node.js
Python
Go
PHP
C/C++
Model loading
Text generation
Tokenization
Streaming
Async generation
Embeddings
Batch embedding
ColBERT scoring
Sampling config
Grammar constraints
LoRA adapters
Image input
Audio input
Session save/load
GPU configuration
Performance metrics
Control vectors
Legend : Full support | Partial/limited | In development | Planned
Binding Technology Details
Language
Technology
Thread Safety
Async Model
Package
Rust
Native
Send+Sync
Tokio
mullama (crate)
Node.js
NAPI-RS
ThreadsafeFunction
Promises/async-await
mullama (npm)
Python
PyO3
GIL-aware
asyncio compatible
mullama (pip)
Go
cgo
goroutine-safe
Goroutines
mullama-go (module)
PHP
FFI
Single-thread
N/A
mullama-php
C/C++
Direct FFI
User-managed
User-managed
libmullama.h
GPU Backend Support
Backend Availability
Backend
Linux
macOS
Windows
Min Driver
Notes
CUDA
--
525.60+
NVIDIA GPUs
Metal
--
--
macOS 13+
Apple Silicon + Intel
ROCm
--
5.5+
AMD GPUs
OpenCL
1.2+
Cross-vendor
Vulkan
--
Planned
GPU Feature Support by Backend
Feature
CUDA
Metal
ROCm
OpenCL
Full layer offloading
Partial layer offloading
Multi-GPU
--
Flash Attention
FP16 compute
BF16 compute
INT8 tensor cores
--
--
Environment Variables for GPU
# Enable specific backends (set before building)
export LLAMA_CUDA = 1 # NVIDIA CUDA
export LLAMA_METAL = 1 # Apple Metal
export LLAMA_HIPBLAS = 1 # AMD ROCm (HIP)
export LLAMA_CLBLAST = 1 # OpenCL (CLBlast)
# CUDA-specific
export CUDA_VISIBLE_DEVICES = 0 ,1 # Select GPUs
export LLAMA_CUDA_F16 = 1 # Force FP16
# Metal-specific (auto-detected on macOS)
# No additional configuration needed
# ROCm-specific
export HIP_VISIBLE_DEVICES = 0 ,1 # Select GPUs
Mullama supports all GGUF quantization types provided by llama.cpp:
Integer Quantization
Format
Bits/Weight
Quality
Speed
Memory (7B)
Notes
Q2_K
2.63
Low
Fastest
~2.6 GB
Extreme compression
Q3_K_S
3.44
Low-Med
Fast
~3.0 GB
Small K-quant
Q3_K_M
3.91
Medium
Fast
~3.3 GB
Medium K-quant
Q3_K_L
4.27
Med-High
Fast
~3.6 GB
Large K-quant
Q4_0
4.50
Medium
Fast
~3.8 GB
Legacy, uniform
Q4_1
5.00
Medium
Fast
~4.2 GB
Legacy, with offset
Q4_K_S
4.58
Med-High
Fast
~3.9 GB
Small K-quant
Q4_K_M
4.85
High
Balanced
~4.1 GB
Recommended default
Q5_0
5.50
High
Balanced
~4.6 GB
Legacy, uniform
Q5_1
6.00
High
Balanced
~5.0 GB
Legacy, with offset
Q5_K_S
5.54
High
Balanced
~4.7 GB
Small K-quant
Q5_K_M
5.69
Very High
Balanced
~4.8 GB
Medium K-quant
Q6_K
6.56
Very High
Slower
~5.5 GB
Highest integer quality
Q8_0
8.50
Near-FP16
Slower
~7.1 GB
High quality
Format
Bits/Weight
Quality
Speed
Memory (7B)
Notes
F16
16.00
Reference
Slow
~13.4 GB
Half precision
F32
32.00
Maximum
Slowest
~26.8 GB
Full precision
BF16
16.00
Reference
Slow
~13.4 GB
Brain floating point
Format
Bits/Weight
Quality
Speed
Memory (7B)
Notes
IQ1_S
1.56
Very Low
Fastest
~1.6 GB
Extreme compression
IQ1_M
1.75
Low
Fastest
~1.8 GB
Improved 1-bit
IQ2_XXS
2.06
Low
Very Fast
~2.0 GB
Ultra-small
IQ2_XS
2.31
Low
Very Fast
~2.2 GB
Extra-small
IQ2_S
2.50
Low-Med
Very Fast
~2.4 GB
Small
IQ2_M
2.70
Medium
Fast
~2.5 GB
Medium
IQ3_XXS
3.06
Medium
Fast
~2.8 GB
Ultra-small 3-bit
IQ3_XS
3.30
Medium
Fast
~3.0 GB
Extra-small 3-bit
IQ3_S
3.44
Med-High
Fast
~3.2 GB
Small 3-bit
IQ3_M
3.70
Med-High
Fast
~3.4 GB
Medium 3-bit
IQ4_NL
4.50
High
Balanced
~3.8 GB
Non-linear 4-bit
IQ4_XS
4.25
High
Balanced
~3.6 GB
Extra-small 4-bit
Choosing a Quantization
For most use cases, Q4_K_M provides the best balance of quality, speed, and memory usage. Use Q5_K_M or Q6_K when quality is paramount. Use Q2_K or IQ2 variants for extremely memory-constrained environments (mobile, edge devices).
Quantization Selection Guide
Quality vs Memory Tradeoff:
Quality
|
| F16/F32 ●
|
| Q8_0 ●
| Q6_K ●
| Q5_K_M ●
| Q5_K_S ●
| Q4_K_M ● ← Recommended
| Q4_K_S ●
| Q3_K_M ●
|Q3_K_S ●
|Q2_K ●
+──────────────────────── Memory Usage
Low High
Feature Flags (Cargo)
Mullama uses Cargo feature flags to control compilation. Features can be combined:
Feature
Description
Dependencies
async
Tokio async runtime support
tokio
streaming
Real-time token streaming
--
multimodal
Image processing pipeline
image crate
streaming-audio
Real-time audio capture + VAD
multimodal, audio libs
format-conversion
Audio/image format conversion
multimodal
web
Axum HTTP server framework
async, axum
websockets
WebSocket support
web, tokio-tungstenite
parallel
Rayon batch processing
rayon
daemon
CLI daemon mode
web, websockets
embedded-ui
Embedded Vue.js Web UI
daemon
full
All features enabled
all of the above
Feature Dependency Chain
full
├── daemon
│ ├── embedded-ui
│ ├── web
│ │ ├── async
│ │ └── websockets
│ └── parallel
├── streaming-audio
│ └── multimodal
├── format-conversion
│ └── multimodal
└── streaming
Build Examples
# Minimal (core inference only)
cargo build --release --no-default-features
# Library with async and streaming
cargo build --release --features "async,streaming"
# Multimodal with audio
cargo build --release --features "multimodal,streaming-audio"
# Web server
cargo build --release --features "web,websockets"
# Full daemon with Web UI
cargo build --release --features "daemon,embedded-ui"
# Everything
cargo build --release --features full