Getting Started¶

Get up and running with Mullama in under 5 minutes. Choose the path that matches your use case.

Choose Your Path¶

Library Developer

Use Mullama from Node.js, Python, Rust, Go, PHP, or C/C++ for native LLM inference in your applications.

1. Install the package for your language
2. Download a model (mullama pull llama3.2:1b)
3. Load and generate in 5 lines of code

Installation
CLI / Server User

Use Mullama as a local AI server with an OpenAI-compatible API, model management, and TUI chat.

1. Install the daemon (cargo install mullama --features daemon)
2. Pull a model (mullama pull llama3.2:1b)
3. Start serving (mullama serve --port 8080)

Daemon & CLI
Migrating from Ollama

Drop-in replacement: same CLI commands, same Modelfile format, same GGUF models. Zero changes required.

What you gain:
Native bindings (no HTTP overhead)
Anthropic API + Web UI + TUI
Embed in your app (6 languages)

Migration Guide

Quick Start¶

Node.jsPythonRustGoPHPC/C++CLI

npm install mullama

index.js

const { Model } = require('mullama');

async function main() {
    const model = new Model('models/llama3.2-1b.gguf', { contextSize: 2048 });
    const response = await model.generate('Explain quantum computing in one sentence.', {
        maxTokens: 100
    });
    console.log(response);
}

main();

pip install mullama

main.py

from mullama import Model

model = Model("models/llama3.2-1b.gguf", context_size=2048)
response = model.generate("Explain quantum computing in one sentence.", max_tokens=100)
print(response)

Cargo.toml

[dependencies]
mullama = { version = "0.3.0", features = ["async", "streaming"] }
tokio = { version = "1", features = ["full"] }

src/main.rs

use mullama::prelude::*;

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    let model = ModelBuilder::new()
        .path("models/llama3.2-1b.gguf")
        .context_size(2048)
        .build().await?;

    let response = model.generate("Explain quantum computing in one sentence.", 100).await?;
    println!("{}", response);
    Ok(())
}

go get github.com/cognisoc/mullama

main.go

package main

import (
    "fmt"
    "github.com/cognisoc/mullama"
)

func main() {
    model, _ := mullama.NewModel("models/llama3.2-1b.gguf", mullama.WithContextSize(2048))
    defer model.Close()

    response, _ := model.Generate("Explain quantum computing in one sentence.", mullama.MaxTokens(100))
    fmt.Println(response)
}

composer require mullama/mullama

main.php

<?php
require_once 'vendor/autoload.php';
use Mullama\Model;

$model = new Model('models/llama3.2-1b.gguf', contextSize: 2048);
$response = $model->generate('Explain quantum computing in one sentence.', maxTokens: 100);
echo $response . "\n";

main.c

#include "mullama.h"

int main() {
    mullama_config_t config = mullama_config_default();
    config.context_size = 2048;

    mullama_model_t* model = mullama_model_load("models/llama3.2-1b.gguf", &config);
    char* response = mullama_generate(model, "Explain quantum computing in one sentence.", 100);
    printf("%s\n", response);

    mullama_free(response);
    mullama_model_free(model);
    return 0;
}

# Install the daemon
cargo install mullama --features daemon

# Pull and run
mullama pull llama3.2:1b
mullama run llama3.2:1b "Explain quantum computing in one sentence."

Download a Model¶

You need a GGUF model file to use Mullama. There are several ways to obtain one.

Using the CLI (Recommended)¶

# Pull a model by alias (auto-downloads from registry)
mullama pull llama3.2:1b

# List downloaded models
mullama list

# Show model details
mullama show llama3.2:1b --modelfile

From HuggingFace¶

Download GGUF models directly from HuggingFace repositories:

# Using wget
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
    -O models/llama3.2-1b.gguf

# Using the HuggingFace CLI
pip install huggingface-hub
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF \
    Llama-3.2-1B-Instruct-Q4_K_M.gguf \
    --local-dir models/

Direct Download¶

For air-gapped environments, download the .gguf file directly and place it in your models directory:

mkdir -p models/
# Copy or download your .gguf file into this directory
ls models/*.gguf

Model Aliases¶

The Mullama daemon supports convenient aliases that map to specific model files and quantizations:

Alias	Model	Quantization	Size	Best For
`llama3.2:1b`	Llama 3.2 1B Instruct	Q4_K_M	~1.3 GB	Fast prototyping, edge devices
`llama3.2:3b`	Llama 3.2 3B Instruct	Q4_K_M	~2.0 GB	Balanced speed/quality
`qwen2.5:7b`	Qwen 2.5 7B Instruct	Q4_K_M	~4.4 GB	Multilingual, coding
`deepseek-r1:7b`	DeepSeek R1 7B	Q4_K_M	~4.5 GB	Reasoning, chain-of-thought
`phi-4:14b`	Phi-4 14B	Q4_K_M	~8.4 GB	Strong reasoning, compact
`mistral:7b`	Mistral 7B Instruct	Q4_K_M	~4.1 GB	General purpose
`codellama:7b`	Code Llama 7B	Q4_K_M	~4.1 GB	Code generation
`llama3.1:8b`	Llama 3.1 8B Instruct	Q4_K_M	~4.9 GB	Long context (128K)
`gemma2:9b`	Gemma 2 9B	Q4_K_M	~5.5 GB	Google's latest
`llama3.3:70b`	Llama 3.3 70B Instruct	Q4_K_M	~40 GB	Maximum quality

Quantization Levels

GGUF models come in various quantization levels that trade quality for size:

Q4_K_M -- Best balance of quality and size (recommended for most users)
Q5_K_M -- Higher quality, ~20% larger than Q4
Q6_K -- Near-original quality, good for critical applications
Q8_0 -- Highest practical quality, 2x size of Q4
F16 -- Full precision, largest files (usually unnecessary)

Custom Models

You can create custom model configurations using Modelfiles:

mullama create my-assistant -f Modelfile
mullama run my-assistant "Hello!"

See the Modelfile Format documentation for details.

Prerequisites¶

System Requirements

RAM: 8 GB minimum (16 GB recommended for 7B models)
Disk: 2-50 GB depending on model size
OS: Linux (x86_64, aarch64), macOS (Apple Silicon, Intel), Windows (x86_64)
Rust: 1.75+ (for building from source or using the Rust library)
CMake: 3.12+ (for building llama.cpp)
C++ Compiler: GCC 9+, Clang 10+, or MSVC 2019+ (C++17 support required)

Next Steps¶

Installation

Detailed installation for all languages, feature flags, and build options.
Platform Setup

Install OS-specific dependencies for audio, image, and video processing.
:material-gpu: GPU Acceleration

Configure CUDA, Metal, ROCm, or OpenCL for faster inference.
Your First Project

Build a complete chatbot from scratch with streaming and multi-turn conversation.
Library Guide

Deep dive into models, contexts, sampling, streaming, and more.
Ollama Migration

Switch from Ollama with zero code changes using API compatibility.