Language Bindings¶
This is Mullama's key differentiator: native LLM inference in 6 languages with near-zero overhead.
In-process. Zero IPC. Zero HTTP.
Unlike Ollama (HTTP-only), Mullama bindings call directly into compiled Rust code. No serialization, no network calls, no separate process.
Result: 100-1000x faster call initiation. Microseconds instead of milliseconds.
Mullama provides official bindings for Node.js, Python, Go, PHP, Rust, and C/C++. All bindings share the same high-performance Rust core through a stable C ABI.
Architecture¶
All bindings share a common architecture that ensures consistent behavior and performance across languages:
graph TD
A["<b>Rust Core</b><br/>Model loading, inference engine,<br/>sampling, embeddings, KV cache"] --> B["<b>FFI Layer (C ABI)</b><br/>~50 exported functions,<br/>handle management, thread-local errors,<br/>streaming callbacks"]
B --> C["<b>Node.js</b><br/>napi-rs"]
B --> D["<b>Python</b><br/>PyO3"]
B --> E["<b>Go</b><br/>cgo"]
B --> F["<b>PHP</b><br/>PHP FFI"]
B --> G["<b>C/C++</b><br/>Direct linking"]
+---------------------------------------------+
| mullama (Rust core) |
| Model loading, inference, sampling, |
| embeddings, KV cache, tokenization |
+----------------------+----------------------+
|
+----------------------+----------------------+
| mullama-ffi (C ABI) |
| ~50 FFI functions, handle management, |
| thread-local errors, streaming callbacks |
+------+------+------+------+-------+---------+
| | | | |
+---+--+ +-+---+ ++----+ +--+--+ +--+--+
|napi-rs| | PyO3| |PHP | | cgo | | C |
|Node.js| |Python| |FFI | | Go | | C++ |
+------+ +-----+ +-----+ +-----+ +-----+
The Rust core provides the model loading, inference engine, and sampling algorithms. The FFI layer exposes a stable C ABI with memory-safe handle management, thread-local error messages, and callback-based streaming. Each language binding wraps this C ABI using idiomatic patterns for that language.
Near-Zero Overhead
All language bindings call directly into compiled Rust code through the C FFI layer. There is no serialization, no IPC, and no intermediate process. The only overhead is the function call boundary itself, which is negligible compared to the inference computation.
Feature Comparison¶
| Feature | Node.js | Python | Go | PHP | C/C++ |
|---|---|---|---|---|---|
| Model Loading | |||||
| Text Generation | |||||
| Streaming Generation | |||||
| Async Support | |||||
| Tokenization | |||||
| Embeddings | |||||
| Batch Embeddings | |||||
| Chat Templates | |||||
| Structured Output | |||||
| GPU Offload | |||||
| Cosine Similarity | |||||
| Cancellation | |||||
| TypeScript Types | N/A | N/A | N/A | N/A | |
| Type Stubs | N/A | N/A | N/A | N/A |
Streaming in PHP
PHP's streaming support currently returns results as an array rather than providing real-time token callbacks. The C/C++ FFI layer does support true streaming via callbacks, but PHP FFI does not support callback functions.
Platform Support¶
Pre-built binaries are available for the following platforms:
| Platform | Architecture | CPU | CUDA | Metal | Node.js | Python | Go | PHP | C/C++ |
|---|---|---|---|---|---|---|---|---|---|
| Linux | x64 | - | |||||||
| Linux | ARM64 | - | - | ||||||
| macOS | x64 (Intel) | - | - | ||||||
| macOS | ARM64 (Apple Silicon) | - | |||||||
| Windows | x64 | - |
Installation Summary¶
Requires Node.js >= 16. Pre-built native addons via napi-rs for all major platforms. TypeScript definitions included.
Requires Python >= 3.8. Pre-built wheels via PyO3 and maturin. NumPy integration for embeddings.
Requires Go >= 1.21 and CGO_ENABLED=1. The libmullama_ffi shared library must be available at link time.
Requires PHP >= 8.1 with the FFI extension enabled. Uses the pre-built libmullama_ffi shared library.
Common API Patterns¶
All bindings follow a consistent pattern regardless of language. Here is the same operation shown across all supported languages:
1. Load a Model¶
2. Create a Context¶
3. Generate Text¶
4. Generate Embeddings¶
Sampler Presets¶
All bindings provide the same four sampler presets for consistent behavior across languages. These presets cover common use cases without requiring manual tuning of sampling parameters:
| Preset | Temperature | Top-K | Top-P | Min-P | Repeat Penalty | Use Case |
|---|---|---|---|---|---|---|
greedy |
0.0 | 1 | 1.0 | 0.0 | 1.0 | Deterministic, factual output |
precise |
0.3 | 20 | 0.8 | 0.05 | 1.1 | Focused, low-variance responses |
default |
0.8 | 40 | 0.95 | 0.05 | 1.1 | Balanced generation |
creative |
1.2 | 100 | 0.95 | 0.0 | 1.0 | Creative writing, brainstorming |
Choosing a Preset
- Use greedy for tasks with a single correct answer (math, code completion, factual Q&A).
- Use precise for structured tasks that benefit from slight variation (summaries, translations).
- Use default for general-purpose generation (chatbots, assistants).
- Use creative for open-ended tasks (story writing, brainstorming, poetry).
Building from Source¶
All bindings can be built from source. Prerequisites:
- Rust toolchain (1.75+)
- System dependencies (see Platform Setup)
- Language-specific tools (listed below)
# Build the FFI shared library (required by Go, PHP, and C/C++)
cargo build --release -p mullama-ffi
# Build Node.js native module
cd bindings/node && npm install && npm run build
# Build Python wheel
cd bindings/python && pip install maturin && maturin build --release
# Build Go bindings (compilation check)
cd bindings/go && CGO_ENABLED=1 go build ./...
# Run PHP tests
cd bindings/php && composer install && composer test
| Language | Build Tool | Build Command |
|---|---|---|
| Node.js | napi-rs, @napi-rs/cli | npm run build |
| Python | PyO3, maturin | maturin build --release |
| Go | cgo | go build ./... |
| PHP | Composer | composer install |
| C/C++ | Cargo | cargo build --release -p mullama-ffi |
Performance Notes¶
All bindings achieve near-native performance because they call directly into compiled Rust code:
- No serialization -- data is passed through the C ABI as raw pointers and primitive types.
- No IPC -- everything runs in-process, in the same address space.
- No intermediate runtime -- the FFI layer is a thin C wrapper around Rust functions.
- Zero-copy where possible -- embeddings and token arrays are returned as direct memory views when the language supports it (e.g., NumPy arrays in Python).
The primary performance factors are:
- GPU offloading -- set GPU layers to -1 for maximum throughput.
- Batch size -- larger batches improve prompt processing speed.
- Context size -- smaller contexts use less memory and can be faster.
- Model quantization -- use Q4_K_M or Q5_K_M for the best speed/quality balance.
Next Steps¶
Choose the binding for your language:
-
High-performance bindings via napi-rs with full TypeScript support, async patterns, and framework integrations for Express, Fastify, and Next.js.
-
PyO3-based bindings with NumPy integration for embeddings, type stubs, and framework integrations for FastAPI, Flask, and Django.
-
cgo-based bindings with idiomatic Go error handling, goroutine-safe model sharing, and net/http server examples.
-
FFI-based bindings for PHP 8.1+ with Laravel and Symfony integrations and Composer distribution.
-
Direct FFI layer access with complete API reference, RAII wrappers for C++, CMake integration, and cancellation support.