Skip to content

Drop-in Ollama replacement. Native language bindings. Production-ready.

Mullama is a local LLM server and library that works just like Ollama -- same CLI commands, same Modelfile syntax -- but with native bindings for Python, Node.js, Go, PHP, Rust, and C/C++.

curl -fsSL https://mullama.cognisoc.com/install.sh | sh
mullama run llama3.2:1b "Hello!"

Two Ways to Use Mullama

  • Use as a Library


    Embed LLM inference directly in your application with native bindings. No HTTP overhead, no separate process -- just import and generate.

    Supported: Node.js, Python, Rust, Go, PHP, C/C++

    Library Guide

  • Use as a Server


    Run mullama as a daemon with OpenAI-compatible APIs, a web UI, and multi-model management. Drop-in replacement for Ollama with more power.

    Compatible: OpenAI SDK, Anthropic SDK, curl, any HTTP client

    Daemon & CLI


Quick Start

npm install mullama
const { Model, Context } = require('mullama');

async function main() {
  const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
  const ctx = new Context(model, { contextSize: 4096 });

  const response = await ctx.generate('Explain quantum computing in one sentence.');
  console.log(response);
}

main();
pip install mullama
from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

response = ctx.generate('Explain quantum computing in one sentence.')
print(response)
# Cargo.toml
[dependencies]
mullama = { version = "0.3", features = ["async", "streaming"] }
use mullama::{Model, Context, ContextParams};
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model = Arc::new(Model::load("llama3.2-1b.gguf")?);
    let params = ContextParams { n_ctx: 4096, ..Default::default() };
    let mut ctx = Context::new(model, params)?;

    let response = ctx.generate("Explain quantum computing in one sentence.", 256)?;
    println!("{}", response);
    Ok(())
}
# Install and run in one command
mullama run llama3.2:1b "Explain quantum computing in one sentence."
# Or start a server with OpenAI-compatible API
mullama serve --model llama3.2:1b

# Use with any OpenAI SDK
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:1b", "messages": [{"role": "user", "content": "Hello!"}]}'

Why Mullama?

  • Native Performance


    Direct function calls instead of HTTP roundtrips. Microseconds of overhead instead of milliseconds. Your LLM runs in-process, not in a separate server.

  • Multi-Language Bindings


    First-class support for Node.js, Python, Go, PHP, and C/C++. All bindings share the same high-performance Rust core via a unified FFI layer.

  • :material-gpu: GPU Accelerated


    NVIDIA CUDA, Apple Metal, AMD ROCm, OpenCL, Vulkan, SYCL, and RPC. Automatic detection and configuration. Full GPU offload or partial layer offloading.

  • Production Ready


    Memory-safe Rust core. Comprehensive error handling. Prometheus metrics. Graceful shutdown. Session persistence. Zero unsafe in the public API.

  • Multimodal


    Process text, images, and audio in a unified pipeline. Real-time audio capture with voice activity detection. Vision-language model support with CLIP and DINOv2.

  • API Compatible


    OpenAI and Anthropic-compatible API endpoints. Use your existing SDKs and tools without changes. Drop-in replacement for cloud APIs in development.


How It Compares

Capability Mullama Ollama Raw llama.cpp
Native language bindings Node.js, Python, Go, PHP, C -- C/C++ only
Embed in your application Yes No (HTTP only) Yes (C API)
OpenAI-compatible API Yes Yes --
Anthropic-compatible API Yes -- --
Streaming generation Native + SSE SSE only Callback
Async/await support Native -- --
Real-time audio input Yes (VAD) -- --
Web framework integration Axum built-in -- --
WebSocket server Built-in -- --
Grammar constraints GBNF -- GBNF
JSON structured output Schema-based JSON mode --
Embeddings Multi-strategy Basic Basic
LoRA adapters Hot-swap Modelfile CLI flag
ColBERT late interaction Yes -- --
SIMD-accelerated sampling AVX2/512, NEON -- Inference only
Batch parallel processing Rayon -- --
Memory-mapped models Yes Yes Yes
Web UI Built-in -- --
TUI chat interface Built-in -- --
Model aliases 40+ pre-configured Yes --

Full comparison with Ollama


Built for Real Applications

  • Chatbot

    Build conversational AI with streaming responses, multi-turn context, and chat templates for any model format.

    Tutorial

  • RAG Pipeline

    Semantic search with embeddings, ColBERT late interaction scoring, and grammar-constrained generation for structured answers.

    Tutorial

  • Voice Assistant

    Real-time audio capture with voice activity detection, speech-to-text processing, and streaming text generation.

    Tutorial

  • API Server

    Production API server with OpenAI compatibility, streaming SSE, rate limiting, and Prometheus metrics.

    Tutorial


Architecture

Mullama is built in three layers, each providing progressively higher-level abstractions:

┌─────────────────────────────────────────────────────────┐
│                    Your Application                       │
├──────────┬──────────┬─────────┬──────────┬──────────────┤
│ Node.js  │  Python  │   Go    │   PHP    │    C/C++     │
│  (NAPI)  │  (PyO3)  │  (cgo)  │  (FFI)   │   (Header)  │
├──────────┴──────────┴─────────┴──────────┴──────────────┤
│                                                          │
│   Integration Layer                                      │
│   Async | Streaming | Web | WebSocket | Multimodal       │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│   Core API Layer                                         │
│   Model | Context | Sampler | Batch | Embedding          │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│   Foundation Layer                                       │
│   FFI Bindings | Memory Management | Platform Detection  │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                     llama.cpp (C++)                       │
├──────────────────────────────────────────────────────────┤
│       CUDA | Metal | ROCm | OpenCL | Vulkan | SYCL | RPC │
└──────────────────────────────────────────────────────────┘

By the Numbers

41,000+ Lines of Rust integration code
6 Native language bindings
7 GPU acceleration backends (CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, RPC)
40+ Pre-configured model aliases
7 Hardware presets with auto-detection
2 API compatibility layers (OpenAI + Anthropic)
10+ Sampling strategies with SIMD acceleration

Get Started

  • Install

    Get mullama running on your platform in under 2 minutes.

    Installation

  • First Project

    Build a working chatbot from scratch in 15 minutes.

    Tutorial

  • Why Mullama?

    See how mullama compares to Ollama and when to use each.

    Comparison

  • API Reference

    Complete reference for all types, methods, and configuration options.

    API Docs