Library Guide¶
This guide covers library-first LLM inference with Mullama. Embed a high-performance inference engine directly in your application -- no daemon, no HTTP, no separate process.
Library vs Daemon
This guide is for library usage (direct function calls in your code). If you need a server with OpenAI-compatible APIs, see the Daemon & CLI docs.
Library advantages: Zero HTTP overhead, in-process inference, microsecond latency.
The APIs documented here work across all 6 supported languages: Node.js, Python, Rust, Go, PHP, and C/C++.
Architecture Overview¶
Mullama is organized in three layers, each building on the one below:
┌─────────────────────────────────────────────────────────────────┐
│ Integration Layer │
│ Async │ Streaming │ Web/WS │ Multimodal │ Parallel │
├─────────────────────────────────────────────────────────────────┤
│ Core API Layer │
│ Model │ Context │ Sampler │ Batch │ Embedding │
├─────────────────────────────────────────────────────────────────┤
│ Foundation Layer │
│ sys.rs (FFI bindings) │ build.rs (platform config) │
│ llama.cpp C++ library │ GPU acceleration backends │
└─────────────────────────────────────────────────────────────────┘
Foundation Layer (sys.rs, build.rs): Low-level FFI bindings to the llama.cpp C++ library. Handles platform-specific build configuration, GPU acceleration detection, and memory-safe wrappers around unsafe C operations. You never interact with this layer directly.
Core API Layer: The primary API surface that applications interact with. Provides safe, ergonomic abstractions over the foundation layer. Available in all supported languages: Node.js, Python, Rust, Go, PHP, and C/C++.
Integration Layer: Optional, feature-gated modules that extend core functionality for specific use cases like async runtimes, streaming, web services, and multimodal processing.
Core Concepts¶
Model¶
The Model represents a loaded GGUF model file. It provides access to the model's vocabulary, architecture parameters, and tokenization capabilities. Models are designed for shared ownership -- a single loaded model can serve many concurrent inference contexts.
Context¶
The Context holds the inference state, including the KV cache and generation position. You create a context from a model and use it to run inference.
Sampler¶
The Sampler controls how the next token is selected from the model's probability distribution. Mullama supports all llama.cpp sampling strategies including temperature, top-k, top-p, min-p, mirostat, and grammar-constrained sampling.
Batch¶
The Batch enables efficient processing of multiple token sequences simultaneously, which is essential for high-throughput applications.
Embedding¶
The Embedding module generates vector representations of text, useful for semantic search, RAG pipelines, and similarity calculations.
Design Patterns¶
Builder Pattern¶
Mullama uses the builder pattern extensively for complex configuration. All builders follow the same .build() completion pattern:
RAII (Resource Acquisition Is Initialization)¶
All resources (models, contexts, samplers, LoRA adapters) are automatically cleaned up when they go out of scope. In Node.js and Python, the garbage collector handles this. In Rust, the Drop trait ensures deterministic cleanup.
Result-Based Error Handling¶
All fallible operations return errors in a language-appropriate way: exceptions in Node.js/Python, Result<T, MullamaError> in Rust.
Arc for Shared Ownership¶
Models can be shared across multiple contexts and threads. In Rust, wrap models in Arc. In Node.js and Python, reference counting is handled automatically.
Feature Flags¶
Mullama uses Cargo feature flags (Rust) to keep the core library lightweight. In Node.js and Python, all features are bundled in the native binary.
| Feature | Description | Dependencies |
|---|---|---|
async |
AsyncModel, AsyncContext, Tokio integration | tokio |
streaming |
TokenStream, StreamConfig | async |
web |
Axum REST API integration | async |
websockets |
Real-time bidirectional communication | async |
multimodal |
Image and audio processing | -- |
streaming-audio |
Real-time audio capture | multimodal |
format-conversion |
Audio/image format conversion | multimodal |
parallel |
Rayon-based batch processing | -- |
tokio-runtime |
MullamaRuntime, TaskManager | async |
full |
All features enabled | all |
# Rust: enable specific features in Cargo.toml
[dependencies]
mullama = { version = "0.3", features = ["async", "streaming"] }
Node.js and Python
The Node.js (@mullama/node) and Python (mullama) packages include all features by default. No additional configuration is needed.
How to Read This Guide¶
The guide is organized from fundamental to advanced topics:
- Loading Models -- Start here. Learn how to load and configure GGUF models.
- Text Generation -- Core inference: context parameters, sampling basics, and chat templates.
- Streaming -- Real-time token output for responsive applications.
- Async Support -- Non-blocking and concurrent inference.
- Embeddings -- Vector representations for search and RAG.
- Sampling Strategies -- Deep dive into all sampling methods and configurations.
- Structured Output -- JSON Schema-constrained generation.
- Grammar Constraints -- GBNF grammars for arbitrary output formats.
- LoRA Adapters -- Fine-tuned adapter loading and hot-swapping.
- Multimodal -- Vision-language and audio-language model support.
- Sessions & State -- Saving and restoring inference state.
- Memory Management -- KV cache management, monitoring, and optimization.
Quick Start
If you are new to Mullama, start with Loading Models and Text Generation to get a working example running quickly.
See Also¶
- Getting Started -- Installation and platform setup
- API Reference -- Complete API documentation
- Tutorials & Examples -- End-to-end application examples
- Language Bindings -- Language-specific binding documentation