Drop-in Ollama replacement. Native language bindings. Production-ready.¶

Mullama is a local LLM server and library that works just like Ollama -- same CLI commands, same Modelfile syntax -- but with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++.

curl -fsSL https://mullama.cognisoc.com/install.sh | sh
mullama run llama3.2:1b "Hello!"

Get Started Compare to Ollama View on GitHub

Two Ways to Use Mullama¶

Use as a Library

Embed LLM inference directly in your application with native bindings. No HTTP overhead, no separate process -- just import and generate.

Supported: Node.js, Python, Rust, Go, PHP, C/C++

Library Guide
Use as a Server

Run mullama as a daemon with OpenAI-compatible APIs, a web UI, and multi-model management. Drop-in replacement for Ollama with more power.

Compatible: OpenAI SDK, Anthropic SDK, curl, any HTTP client

Daemon & CLI

Quick Start¶

Node.jsPythonRustCLI

npm install mullama

const { Model, Context } = require('mullama');

async function main() {
  const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
  const ctx = new Context(model, { contextSize: 4096 });

  const response = await ctx.generate('Explain quantum computing in one sentence.');
  console.log(response);
}

main();

pip install mullama

from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

response = ctx.generate('Explain quantum computing in one sentence.')
print(response)

# Cargo.toml
[dependencies]
mullama = { version = "0.3", features = ["async", "streaming"] }

use mullama::{Model, Context, ContextParams};
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model = Arc::new(Model::load("llama3.2-1b.gguf")?);
    let params = ContextParams { n_ctx: 4096, ..Default::default() };
    let mut ctx = Context::new(model, params)?;

    let response = ctx.generate("Explain quantum computing in one sentence.", 256)?;
    println!("{}", response);
    Ok(())
}

# Install and run in one command
mullama run llama3.2:1b "Explain quantum computing in one sentence."

# Or start a server with OpenAI-compatible API
mullama serve --model llama3.2:1b

# Use with any OpenAI SDK
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:1b", "messages": [{"role": "user", "content": "Hello!"}]}'

Why Mullama?¶

Native Performance

Direct function calls instead of HTTP roundtrips. Microseconds of overhead instead of milliseconds. Your LLM runs in-process, not in a separate server.
Multi-Language Bindings

First-class support for Node.js, Python, Go, PHP, and C/C++. All bindings share the same high-performance Rust core via a unified FFI layer.
:material-gpu: GPU Accelerated

NVIDIA CUDA, Apple Metal, AMD ROCm, OpenCL, Vulkan, SYCL, and RPC. Automatic detection and configuration. Full GPU offload or partial layer offloading.
Production Ready

Memory-safe Rust core. Comprehensive error handling. Prometheus metrics. Graceful shutdown. Session persistence. Zero unsafe in the public API.
Multimodal

Process text, images, and audio in a unified pipeline. Real-time audio capture with voice activity detection. Vision-language model support with CLIP and DINOv2.
API Compatible

OpenAI and Anthropic-compatible API endpoints. Use your existing SDKs and tools without changes. Drop-in replacement for cloud APIs in development.

How It Compares¶

Capability	Mullama	Ollama	Raw llama.cpp
Native language bindings	Node.js, Python, Go, PHP, C	--	C/C++ only
Embed in your application	Yes	No (HTTP only)	Yes (C API)
OpenAI-compatible API	Yes	Yes	--
Anthropic-compatible API	Yes	--	--
Streaming generation	Native + SSE	SSE only	Callback
Async/await support	Native	--	--
Real-time audio input	Yes (VAD)	--	--
Web framework integration	Axum built-in	--	--
WebSocket server	Built-in	--	--
Grammar constraints	GBNF	--	GBNF
JSON structured output	Schema-based	JSON mode	--
Embeddings	Multi-strategy	Basic	Basic
LoRA adapters	Hot-swap	Modelfile	CLI flag
ColBERT late interaction	Yes	--	--
SIMD-accelerated sampling	AVX2/512, NEON	--	Inference only
Batch parallel processing	Rayon	--	--
Memory-mapped models	Yes	Yes	Yes
Web UI	Built-in	--	--
TUI chat interface	Built-in	--	--
Model aliases	40+ pre-configured	Yes	--

Full comparison with Ollama

Built for Real Applications¶

Chatbot

Build conversational AI with streaming responses, multi-turn context, and chat templates for any model format.

Tutorial
RAG Pipeline

Semantic search with embeddings, ColBERT late interaction scoring, and grammar-constrained generation for structured answers.

Tutorial
Voice Assistant

Real-time audio capture with voice activity detection, speech-to-text processing, and streaming text generation.

Tutorial
API Server

Production API server with OpenAI compatibility, streaming SSE, rate limiting, and Prometheus metrics.

Tutorial

Architecture¶

Mullama is built in three layers, each providing progressively higher-level abstractions:

┌─────────────────────────────────────────────────────────┐
│                    Your Application                       │
├──────────┬──────────┬─────────┬──────────┬──────────────┤
│ Node.js  │  Python  │   Go    │   PHP    │    C/C++     │
│  (NAPI)  │  (PyO3)  │  (cgo)  │  (FFI)   │   (Header)  │
├──────────┴──────────┴─────────┴──────────┴──────────────┤
│                                                          │
│   Integration Layer                                      │
│   Async | Streaming | Web | WebSocket | Multimodal       │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│   Core API Layer                                         │
│   Model | Context | Sampler | Batch | Embedding          │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│   Foundation Layer                                       │
│   FFI Bindings | Memory Management | Platform Detection  │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                     llama.cpp (C++)                       │
├──────────────────────────────────────────────────────────┤
│       CUDA | Metal | ROCm | OpenCL | Vulkan | SYCL | RPC │
└──────────────────────────────────────────────────────────┘

By the Numbers¶


41,000+	Lines of Rust integration code
6	Native language bindings
7	GPU acceleration backends (CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, RPC)
40+	Pre-configured model aliases
7	Hardware presets with auto-detection
2	API compatibility layers (OpenAI + Anthropic)
10+	Sampling strategies with SIMD acceleration

Get Started¶

Install

Get mullama running on your platform in under 2 minutes.

Installation
First Project

Build a working chatbot from scratch in 15 minutes.

Tutorial
Why Mullama?

See how mullama compares to Ollama and when to use each.

Comparison
API Reference

Complete reference for all types, methods, and configuration options.

API Docs