API Reference¶

Complete API documentation for all Mullama types, traits, and functions. This reference covers the full public API surface of the library, including core types, feature-gated modules, and utility functions.

How to Read This Reference¶

Each API page follows a consistent structure:

Type signatures are shown as Rust code blocks with full generic bounds
Parameter tables use the format: Name | Type | Default | Description
Error conditions are listed under each method with the specific MullamaError variant returned
Feature gates are indicated with admonitions at the top of each section
Examples demonstrate typical usage patterns with compilable code

Conventions¶

impl AsRef<Path> indicates any type convertible to a filesystem path (e.g., &str, String, PathBuf)
Arc<Model> indicates shared ownership -- the caller retains a reference
&mut self indicates the method mutates internal state
Result<T, MullamaError> is the standard return type for fallible operations

Core Types Diagram¶

                    +------------------+
                    |   MullamaConfig  |
                    | (ModelConfig,    |
                    |  ContextConfig,  |
                    |  SamplerConfig)  |
                    +--------+---------+
                             |
                             v
+------------+       +-------+--------+       +----------------+
| ModelParams| ----> |     Model      | <---- | ModelBuilder    |
+------------+       | (Arc<Inner>)   |       +----------------+
                     | Send + Sync    |
                     +---+-----+------+
                         |     |
            tokenize()   |     |  create_context()
                         v     v
              +----------+     +------------+       +---------------+
              | Vec<Token>     |   Context   | <---- | ContextParams |
              +----------+     | (!Send)     |       +---------------+
                               +---+----+---+
                                   |    |
                        decode()   |    |  generate()
                                   v    v
                    +-----------+  +----+------+
                    |   Batch   |  |  String   |
                    | (SmallVec)|  +-----------+
                    +-----------+
                                   +---------------+
                                   | SamplerChain  |
                                   | (Send + Sync) |
                                   +---+-----------+
                                       |
                                       v
                              +--------+--------+
                              |    Sampler      |
                              | (Send + Sync)   |
                              +-----------------+

Module Organization¶

Module	Description	Feature Gate
`model`	Model loading, metadata, and tokenization	Core
`context`	Inference context, generation, and KV-cache	Core
`sampling`	Sampling strategies and composable chains	Core
`batch`	Efficient multi-token batch processing	Core
`embeddings`	Text embedding generation and similarity	Core
`config`	Configuration management with serde	Core
`error`	Error types and handling patterns	Core
`multimodal`	Text, image, and audio processing	`multimodal`
`async_support`	Async model, context, and runtime	`async`
`streaming`	Real-time token streaming	`streaming`

Core Types Quick Reference¶

Type	Module	Description
`Model`	`model`	Loaded LLM model (Arc-based, Clone, Send+Sync)
`ModelParams`	`model`	Parameters for model loading
`ModelBuilder`	`model`	Fluent API for model configuration
`Context`	`context`	Inference context with KV-cache (!Send)
`ContextParams`	`context`	Context configuration parameters
`KvCacheType`	`context`	KV-cache quantization level
`SamplerParams`	`sampling`	High-level sampling configuration
`SamplerChain`	`sampling`	Composable chain of samplers
`Sampler`	`sampling`	Individual sampling strategy
`Batch`	`batch`	Token batch with SmallVec optimization
`Embeddings`	`embedding`	Generated embedding vectors
`EmbeddingGenerator`	`embedding`	Embedding generation utility
`EmbeddingConfig`	`embedding`	Embedding configuration
`PoolingStrategy`	`embedding`	Embedding pooling method
`MullamaConfig`	`config`	Top-level configuration struct
`MullamaError`	`error`	Unified error type

Feature-Gated Types¶

Types that require specific Cargo features to be enabled:

Type	Required Feature	Description
`AsyncModel`	`async`	Non-blocking model wrapper
`AsyncContext`	`async`	Non-blocking context wrapper
`MullamaRuntime`	`tokio-runtime`	Tokio runtime manager
`ModelPool`	`tokio-runtime`	Connection pool for models
`TaskManager`	`tokio-runtime`	Async task coordinator
`TokenStream`	`streaming`	Real-time token stream
`StreamConfig`	`streaming`	Streaming configuration
`MultimodalProcessor`	`multimodal`	Cross-modal processing
`VisionEncoder`	`multimodal`	Image encoding pipeline
`ImageInput`	`multimodal`	Image data container
`AudioInput`	`multimodal`	Audio data container
`StreamingAudioProcessor`	`streaming-audio`	Real-time audio capture
`AudioStreamConfig`	`streaming-audio`	Audio stream settings
`AppState`	`web`	Axum application state
`RouterBuilder`	`web`	Web route configuration
`WebSocketServer`	`websockets`	WebSocket server
`WebSocketConfig`	`websockets`	WebSocket settings

Thread Safety Overview¶

Mullama types are designed with clear thread safety semantics:

Type	Send	Sync	Clone	Notes
`Model`	Yes	Yes	Yes (cheap)	Arc-based sharing, reference count increment
`Context`	No	No	No	Bound to creating thread; use per-thread instances
`SamplerChain`	Yes	Yes	No	Can be moved between threads
`Sampler`	Yes	Yes	No	Can be moved between threads
`Batch`	Yes	No	Yes	Move between threads, not shared references
`Embeddings`	Yes	Yes	Yes	Plain data, freely shareable
`AsyncModel`	Yes	Yes	Yes (cheap)	Arc-based, designed for concurrent use

Context Thread Safety

Context is intentionally not Send. It holds mutable state (KV-cache, position counters) that is not safe for concurrent access. Each thread must create its own Context from a shared Arc<Model>. For async contexts, use AsyncContext which handles thread-safety internally via spawn_blocking.

Typical Multi-Threaded Pattern¶

use mullama::{Model, Context, ContextParams};
use std::sync::Arc;
use std::thread;

let model = Arc::new(Model::load("model.gguf")?);

let handles: Vec<_> = (0..4).map(|i| {
    let model = model.clone(); // Cheap Arc clone
    thread::spawn(move || {
        // Each thread creates its own context
        let mut ctx = Context::new(model, ContextParams::default()).unwrap();
        ctx.generate(&[1, 2, 3], 50).unwrap()
    })
}).collect();

for handle in handles {
    let result = handle.join().unwrap();
    println!("{}", result);
}

Versioning and Stability Guarantees¶

Mullama follows Semantic Versioning:

Major version (0.x): The library is in pre-1.0 development. Breaking changes may occur between minor versions.
Minor version: May include new features and non-breaking API additions.
Patch version: Bug fixes and documentation improvements only.

Stability Tiers¶

Tier	Guarantee	Types
Stable	No breaking changes without major bump	`Model`, `Context`, `SamplerParams`, `Batch`, `Embeddings`, `MullamaError`
Unstable	May change between minor versions	Feature-gated types (`AsyncModel`, `TokenStream`, `MultimodalProcessor`)
Internal	No stability guarantee	Types in `sys` module, FFI bindings

Feature Flags¶

[dependencies.mullama]
version = "0.3"
features = [
    "async",            # AsyncModel, AsyncContext
    "streaming",        # TokenStream, StreamConfig (requires "async")
    "multimodal",       # MultimodalProcessor, ImageInput, AudioInput
    "streaming-audio",  # StreamingAudioProcessor (requires "multimodal")
    "format-conversion",# AudioConverter, ImageConverter (requires "multimodal")
    "web",              # Axum integration (requires "async")
    "websockets",       # WebSocket server (requires "async")
    "parallel",         # ParallelProcessor, batch operations
    "tokio-runtime",    # MullamaRuntime, TaskManager
    "late-interaction", # ColBERT-style multi-vector embeddings
    "daemon",           # Background daemon mode
    "full",             # All features enabled
]

Dependency Chains¶

When enabling features, these dependencies are automatically resolved:

streaming requires async
streaming-audio requires multimodal
format-conversion requires multimodal
web requires async
websockets requires async
full enables all features

Cross-Language Equivalents¶

Where applicable, API pages note equivalent APIs in other language bindings:

Rust	Node.js	Python
`Model::load(path)`	`Model.load(path)`	`Model.load(path)`
`model.tokenize(text, true, false)`	`model.tokenize(text)`	`model.tokenize(text)`
`ctx.generate(&tokens, max)`	`ctx.generate(tokens, max)`	`ctx.generate(tokens, max)`
`SamplerParams::default()`	`new SamplerParams()`	`SamplerParams()`

See the Bindings documentation for complete language-specific guides.