Python Bindings¶

High-performance Python bindings for the Mullama LLM library, built with PyO3 for native-speed inference with a Pythonic API. Embeddings are returned as NumPy arrays for seamless integration with the scientific Python ecosystem.

Installation¶

pip install mullama

Pre-built wheels are available for:

Platform	Architecture	Python Versions
Linux	x64 (manylinux2014)	3.8 - 3.12
Linux	arm64	3.8 - 3.12
macOS	x64 (Intel)	3.8 - 3.12
macOS	arm64 (Apple Silicon)	3.8 - 3.12
Windows	x64	3.8 - 3.12

Building from Source¶

# Install maturin
pip install maturin

# Build and install in development mode
cd bindings/python
maturin develop --release

# Or build a wheel for distribution
maturin build --release
pip install target/wheels/mullama-*.whl

Build Requirements

Python >= 3.8
Rust toolchain (1.75+)
NumPy (for embedding operations)
System dependencies (see Platform Setup)
For GPU support, set LLAMA_CUDA=1, LLAMA_METAL=1, etc. before building

Type Stubs¶

The mullama package ships with inline type annotations (PEP 561 compatible). Type checkers like mypy, pyright, and Pylance will recognize the types automatically without any additional packages.

# mypy and pyright work out of the box
from mullama import Model, Context, SamplerParams
model: Model = Model.load("./model.gguf")

Quick Start¶

from mullama import Model, Context, SamplerParams

# Load a model with GPU acceleration
model = Model.load("./llama-3.2-1b.Q4_K_M.gguf", n_gpu_layers=-1)

# Create an inference context
ctx = Context(model, n_ctx=2048)

# Generate text
text = ctx.generate("Once upon a time", max_tokens=100)
print(text)

API Reference¶

Model¶

The Model class handles model loading and provides access to tokenization, metadata, and model properties.

`Model.load(path, ...)`¶

Load a model from a GGUF file.

@staticmethod
def load(
    path: str,
    n_gpu_layers: int = 0,
    use_mmap: bool = True,
    use_mlock: bool = False,
    vocab_only: bool = False,
) -> Model

Parameter	Type	Default	Description
`path`	`str`	(required)	Path to the GGUF model file
`n_gpu_layers`	`int`	`0`	Layers to offload to GPU (0 = CPU, -1 = all)
`use_mmap`	`bool`	`True`	Use memory mapping for loading
`use_mlock`	`bool`	`False`	Lock model in memory (prevents paging)
`vocab_only`	`bool`	`False`	Only load vocabulary (for tokenization)

Returns: Model instance

Raises: RuntimeError if loading fails

# CPU-only
model = Model.load("./model.gguf")

# Full GPU offloading
model = Model.load("./model.gguf", n_gpu_layers=-1)

# Partial GPU offloading (first 20 layers)
model = Model.load("./model.gguf", n_gpu_layers=20)

# Vocabulary only (for tokenization tasks)
model = Model.load("./model.gguf", vocab_only=True)

`model.tokenize(text, add_bos=True, special=False)`¶

Convert text to token IDs.

def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> list[int]

Parameter	Type	Default	Description
`text`	`str`	(required)	Text to tokenize
`add_bos`	`bool`	`True`	Add beginning-of-sequence token
`special`	`bool`	`False`	Parse special tokens

tokens = model.tokenize("Hello, world!")
print(tokens)  # [1, 10994, 29892, 3186, 29991]
print(f"Token count: {len(tokens)}")

`model.detokenize(tokens, remove_special=False, unparse_special=False)`¶

Convert token IDs back to text.

def detokenize(
    self,
    tokens: list[int],
    remove_special: bool = False,
    unparse_special: bool = False,
) -> str

text = model.detokenize([1, 10994, 29892, 3186, 29991])
print(text)  # "Hello, world!"

`model.apply_chat_template(messages, add_generation_prompt=True)`¶

Format chat messages using the model's built-in template (e.g., ChatML, Llama-3 format).

def apply_chat_template(
    self,
    messages: list[tuple[str, str]],
    add_generation_prompt: bool = True,
) -> str

Parameter	Type	Default	Description
`messages`	`list[tuple[str, str]]`	(required)	List of (role, content) tuples
`add_generation_prompt`	`bool`	`True`	Add generation prompt at the end

messages = [
    ("system", "You are a helpful assistant."),
    ("user", "What is Python?"),
]
prompt = model.apply_chat_template(messages)
response = ctx.generate(prompt, max_tokens=300)

`model.metadata()`¶

Get all model metadata as a dictionary.

def metadata(self) -> dict[str, str]

meta = model.metadata()
for key, value in meta.items():
    print(f"{key}: {value}")

`model.token_is_eog(token)`¶

Check if a token is an end-of-generation token.

def token_is_eog(self, token: int) -> bool

Properties¶

Property	Type	Description
`n_ctx_train`	`int`	Training context size
`n_embd`	`int`	Embedding dimension
`n_vocab`	`int`	Vocabulary size
`n_layer`	`int`	Number of layers
`n_head`	`int`	Number of attention heads
`token_bos`	`int`	BOS token ID
`token_eos`	`int`	EOS token ID
`size`	`int`	Model size in bytes
`n_params`	`int`	Number of parameters
`description`	`str`	Model description
`architecture`	`str \\| None`	Architecture name (e.g., "llama")
`name`	`str \\| None`	Model name from metadata

model = Model.load("model.gguf")
print(f"Model: {model.name}")
print(f"Architecture: {model.architecture}")
print(f"Parameters: {model.n_params:,}")
print(f"Layers: {model.n_layer}")
print(f"Embedding dim: {model.n_embd}")
print(f"Size: {model.size / 1e9:.2f} GB")
print(repr(model))
# Model(name='Llama-3.2-1B', arch='llama', params=1234567890, size=1234MB)

Context¶

The Context class provides the inference context for text generation.

`Context(model, ...)`¶

Create a new inference context.

def __init__(
    self,
    model: Model,
    n_ctx: int = 0,
    n_batch: int = 2048,
    n_threads: int = 0,
    embeddings: bool = False,
) -> None

Parameter	Type	Default	Description
`model`	`Model`	(required)	Model to use
`n_ctx`	`int`	`0`	Context size (0 = model default)
`n_batch`	`int`	`2048`	Batch size for prompt processing
`n_threads`	`int`	`0`	Thread count (0 = auto-detect)
`embeddings`	`bool`	`False`	Enable embeddings mode

# Default context
ctx = Context(model)

# Custom context with 4096 token window
ctx = Context(model, n_ctx=4096, n_batch=512)

# Explicitly limit thread count
ctx = Context(model, n_ctx=2048, n_threads=4)

`ctx.generate(prompt, max_tokens=100, params=None)`¶

Generate text from a prompt. The prompt can be a string or a list of token IDs.

def generate(
    self,
    prompt: str | list[int],
    max_tokens: int = 100,
    params: SamplerParams | None = None,
) -> str

Parameter	Type	Default	Description
`prompt`	`str \\| list[int]`	(required)	Text prompt or token IDs
`max_tokens`	`int`	`100`	Maximum tokens to generate
`params`	`SamplerParams \\| None`	`None`	Sampling parameters (None = defaults)

# From string
text = ctx.generate("Hello, AI!", max_tokens=50)

# From token IDs
tokens = model.tokenize("Hello, AI!")
text = ctx.generate(tokens, max_tokens=50)

# With custom sampling
params = SamplerParams(temperature=0.7, top_p=0.9)
text = ctx.generate("Write a poem:", max_tokens=200, params=params)

`ctx.generate_stream(prompt, max_tokens=100, params=None)`¶

Generate text and return token pieces as a list. Useful for streaming-style output.

def generate_stream(
    self,
    prompt: str | list[int],
    max_tokens: int = 100,
    params: SamplerParams | None = None,
) -> list[str]

pieces = ctx.generate_stream("Once upon a time", max_tokens=200)
for piece in pieces:
    print(piece, end="", flush=True)
print()

`ctx.decode(tokens)`¶

Process tokens through the model (for advanced use with embeddings).

def decode(self, tokens: list[int]) -> None

`ctx.clear_cache()`¶

Clear the KV cache. Call this when starting a new conversation or switching between unrelated prompts.

def clear_cache(self) -> None

`ctx.get_embeddings()`¶

Get embeddings for the last decoded tokens (requires embeddings=True in constructor).

def get_embeddings(self) -> numpy.ndarray | None

Returns: NumPy array of float32 values, or None if embeddings are not available.

Properties¶

Property	Type	Description
`n_ctx`	`int`	Current context size
`n_batch`	`int`	Current batch size

SamplerParams¶

The SamplerParams class configures text generation sampling behavior.

`SamplerParams(...)`¶

def __init__(
    self,
    temperature: float = 0.8,
    top_k: int = 40,
    top_p: float = 0.95,
    min_p: float = 0.05,
    typical_p: float = 1.0,
    penalty_repeat: float = 1.1,
    penalty_freq: float = 0.0,
    penalty_present: float = 0.0,
    penalty_last_n: int = 64,
    seed: int = 0,
) -> None

Parameter	Type	Default	Description
`temperature`	`float`	`0.8`	Randomness (0.0 = deterministic)
`top_k`	`int`	`40`	Top-k sampling (0 = disabled)
`top_p`	`float`	`0.95`	Nucleus sampling (1.0 = disabled)
`min_p`	`float`	`0.05`	Min-p sampling (0.0 = disabled)
`typical_p`	`float`	`1.0`	Typical sampling (1.0 = disabled)
`penalty_repeat`	`float`	`1.1`	Repeat penalty (1.0 = disabled)
`penalty_freq`	`float`	`0.0`	Frequency penalty
`penalty_present`	`float`	`0.0`	Presence penalty
`penalty_last_n`	`int`	`64`	Token window for penalties
`seed`	`int`	`0`	Random seed (0 = random)

All properties are readable and writable:

params = SamplerParams(temperature=0.7)
params.top_p = 0.9
params.top_k = 50
print(params.temperature)  # 0.7

Preset Class Methods¶

@staticmethod
def greedy() -> SamplerParams

Create deterministic sampler parameters (temperature=0.0, top_k=1).

@staticmethod
def creative() -> SamplerParams

Create high-randomness sampler parameters (temperature=1.2, top_k=100).

@staticmethod
def precise() -> SamplerParams

Create low-randomness sampler parameters (temperature=0.3, top_k=20).

# Usage
text = ctx.generate("2+2=", max_tokens=10, params=SamplerParams.greedy())
poem = ctx.generate("Write a poem:", max_tokens=200, params=SamplerParams.creative())
summary = ctx.generate("Summarize:", max_tokens=150, params=SamplerParams.precise())

EmbeddingGenerator¶

The EmbeddingGenerator class creates text embeddings using a model. Embeddings are returned as NumPy arrays for seamless integration with scientific Python tools.

`EmbeddingGenerator(model, n_ctx=512, normalize=True)`¶

def __init__(
    self,
    model: Model,
    n_ctx: int = 512,
    normalize: bool = True,
) -> None

Parameter	Type	Default	Description
`model`	`Model`	(required)	Model for embeddings
`n_ctx`	`int`	`512`	Context size for embedding computation
`normalize`	`bool`	`True`	Normalize embeddings to unit length

`gen.embed(text)`¶

Generate an embedding vector for text.

def embed(self, text: str) -> numpy.ndarray

Returns: NumPy ndarray of float32 values (shape: (n_embd,))

import numpy as np

embedding = gen.embed("Hello, world!")
print(f"Shape: {embedding.shape}")       # (4096,)
print(f"Dtype: {embedding.dtype}")       # float32
print(f"Norm: {np.linalg.norm(embedding):.4f}")  # ~1.0 if normalized

`gen.embed_batch(texts)`¶

Generate embeddings for multiple texts efficiently.

def embed_batch(self, texts: list[str]) -> list[numpy.ndarray]

Returns: List of NumPy ndarray embeddings

texts = ["Hello", "World", "Mullama"]
embeddings = gen.embed_batch(texts)
print(f"Count: {len(embeddings)}")       # 3
print(f"Dims: {embeddings[0].shape}")    # (4096,)

Properties¶

Property	Type	Description
`n_embd`	`int`	Embedding dimension

Utility Functions¶

`cosine_similarity(a, b)`¶

Compute cosine similarity between two NumPy embedding vectors.

def cosine_similarity(a: numpy.ndarray, b: numpy.ndarray) -> float

Raises: ValueError if vectors have different lengths.

from mullama import cosine_similarity

sim = cosine_similarity(emb1, emb2)
print(f"Similarity: {sim:.4f}")  # Value between -1 and 1

`backend_init()`¶

Initialize the backend (called automatically on first model load).

def backend_init() -> None

`backend_free()`¶

Free backend resources.

def backend_free() -> None

`supports_gpu_offload()`¶

Check if GPU offloading is available.

def supports_gpu_offload() -> bool

`system_info()`¶

Get system information string (CPU features, GPU support).

def system_info() -> str

`max_devices()`¶

Get maximum number of compute devices.

def max_devices() -> int

Examples¶

Streaming Output¶

from mullama import Model, Context, SamplerParams

model = Model.load("./model.gguf", n_gpu_layers=-1)
ctx = Context(model, n_ctx=4096)

# Stream tokens to console
pieces = ctx.generate_stream(
    "Write a haiku about programming:\n\n",
    max_tokens=50,
    params=SamplerParams.creative(),
)

for piece in pieces:
    print(piece, end="", flush=True)
print()

Embeddings and Semantic Search¶

import numpy as np
from mullama import Model, EmbeddingGenerator, cosine_similarity

model = Model.load("./nomic-embed-text-v1.5.Q4_K_M.gguf")
gen = EmbeddingGenerator(model)

# Build a document index
documents = [
    "Python is a high-level programming language",
    "Machine learning uses statistical methods to learn from data",
    "The weather forecast predicts rain tomorrow",
    "Neural networks are inspired by biological neurons",
    "JavaScript runs in web browsers and on servers",
    "Docker containers package applications with their dependencies",
]

doc_embeddings = gen.embed_batch(documents)

# Convert to NumPy matrix for efficient operations
embedding_matrix = np.array(doc_embeddings)
print(f"Embedding matrix shape: {embedding_matrix.shape}")  # (6, 4096)

# Query
query_emb = gen.embed("What programming languages are popular?")

# Rank by similarity using NumPy
scores = embedding_matrix @ query_emb  # dot product (embeddings are normalized)

ranked_indices = np.argsort(scores)[::-1]
print("Search results:")
for idx in ranked_indices:
    print(f"  [{scores[idx]:.4f}] {documents[idx]}")

NumPy Integration for Embeddings¶

import numpy as np
from mullama import Model, EmbeddingGenerator

model = Model.load("./embedding-model.gguf")
gen = EmbeddingGenerator(model)

# Embeddings are already NumPy arrays
emb = gen.embed("Hello, world!")
print(type(emb))         # <class 'numpy.ndarray'>
print(emb.dtype)         # float32
print(emb.shape)         # (4096,)

# Direct NumPy operations
norm = np.linalg.norm(emb)
print(f"L2 norm: {norm:.4f}")  # ~1.0 for normalized embeddings

# Batch: stack into matrix
texts = ["doc1", "doc2", "doc3"]
embeddings = gen.embed_batch(texts)
matrix = np.stack(embeddings)  # shape: (3, 4096)

# Pairwise cosine similarity matrix
similarity_matrix = matrix @ matrix.T
print(f"Similarity matrix shape: {similarity_matrix.shape}")  # (3, 3)

# Save/load embeddings
np.save("embeddings.npy", matrix)
loaded = np.load("embeddings.npy")

Chat Conversations¶

from mullama import Model, Context, SamplerParams

model = Model.load("./llama-3.2-1b-instruct.gguf", n_gpu_layers=-1)
ctx = Context(model, n_ctx=4096)

messages: list[tuple[str, str]] = [
    ("system", "You are a helpful Python tutor. Be concise and provide code examples."),
]


def chat(user_message: str) -> str:
    """Send a message and get a response."""
    messages.append(("user", user_message))
    prompt = model.apply_chat_template(messages)

    # Check context length
    prompt_tokens = model.tokenize(prompt, add_bos=False)
    if len(prompt_tokens) > ctx.n_ctx - 500:
        # Trim old messages but keep system prompt
        del messages[1:-4]
        prompt = model.apply_chat_template(messages)

    ctx.clear_cache()
    response = ctx.generate(prompt, max_tokens=400, params=SamplerParams.precise())
    messages.append(("assistant", response))
    return response


# Interactive conversation
print(chat("What are list comprehensions?"))
print(chat("Show me an example with filtering."))
print(chat("How about nested list comprehensions?"))

FastAPI Server with Streaming SSE¶

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from mullama import Model, Context, SamplerParams, EmbeddingGenerator
import json

app = FastAPI(title="Mullama API")

# Load model at startup
model = Model.load("./model.gguf", n_gpu_layers=-1)


class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 200
    temperature: float = 0.8
    stream: bool = False


class EmbedRequest(BaseModel):
    texts: list[str]


class GenerateResponse(BaseModel):
    text: str
    tokens_generated: int


@app.post("/generate", response_model=GenerateResponse)
def generate(req: GenerateRequest):
    ctx = Context(model, n_ctx=2048)
    params = SamplerParams(temperature=req.temperature)
    text = ctx.generate(req.prompt, max_tokens=req.max_tokens, params=params)
    tokens = model.tokenize(text, add_bos=False)
    return GenerateResponse(text=text, tokens_generated=len(tokens))


@app.post("/generate/stream")
def generate_stream(req: GenerateRequest):
    def event_stream():
        ctx = Context(model, n_ctx=2048)
        params = SamplerParams(temperature=req.temperature)
        pieces = ctx.generate_stream(
            req.prompt, max_tokens=req.max_tokens, params=params
        )
        for piece in pieces:
            yield f"data: {json.dumps({'token': piece})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(event_stream(), media_type="text/event-stream")


@app.post("/embed")
def embed(req: EmbedRequest):
    gen = EmbeddingGenerator(model)
    embeddings = gen.embed_batch(req.texts)
    return {"embeddings": [emb.tolist() for emb in embeddings]}

Run with:

uvicorn server:app --host 0.0.0.0 --port 8000

Flask Server¶

from flask import Flask, request, jsonify, Response
from mullama import Model, Context, SamplerParams, EmbeddingGenerator
import json

app = Flask(__name__)

model = Model.load("./model.gguf", n_gpu_layers=-1)


@app.route("/generate", methods=["POST"])
def generate():
    data = request.json
    ctx = Context(model, n_ctx=2048)
    params = SamplerParams(temperature=data.get("temperature", 0.8))
    text = ctx.generate(
        data["prompt"],
        max_tokens=data.get("max_tokens", 100),
        params=params,
    )
    return jsonify({"text": text})


@app.route("/generate/stream", methods=["POST"])
def generate_stream():
    data = request.json

    def event_stream():
        ctx = Context(model, n_ctx=2048)
        params = SamplerParams(temperature=data.get("temperature", 0.8))
        pieces = ctx.generate_stream(
            data["prompt"],
            max_tokens=data.get("max_tokens", 200),
            params=params,
        )
        for piece in pieces:
            yield f"data: {json.dumps({'token': piece})}\n\n"
        yield "data: [DONE]\n\n"

    return Response(event_stream(), mimetype="text/event-stream")


@app.route("/embed", methods=["POST"])
def embed():
    data = request.json
    gen = EmbeddingGenerator(model)
    embeddings = gen.embed_batch(data["texts"])
    return jsonify({"embeddings": [emb.tolist() for emb in embeddings]})


if __name__ == "__main__":
    app.run(port=8000)

Django Integration¶

# myapp/services.py
from mullama import Model, Context, SamplerParams, EmbeddingGenerator
from django.conf import settings

_model = None


def get_model() -> Model:
    """Lazy-load the model as a singleton."""
    global _model
    if _model is None:
        _model = Model.load(
            settings.MULLAMA_MODEL_PATH,
            n_gpu_layers=getattr(settings, "MULLAMA_GPU_LAYERS", -1),
        )
    return _model


def generate_text(prompt: str, max_tokens: int = 200, temperature: float = 0.8) -> str:
    model = get_model()
    ctx = Context(model, n_ctx=2048)
    params = SamplerParams(temperature=temperature)
    return ctx.generate(prompt, max_tokens=max_tokens, params=params)


def get_embeddings(texts: list[str]) -> list:
    model = get_model()
    gen = EmbeddingGenerator(model)
    return gen.embed_batch(texts)

# myapp/views.py
from django.http import JsonResponse
from django.views.decorators.http import require_POST
import json
from .services import generate_text, get_embeddings


@require_POST
def generate_view(request):
    data = json.loads(request.body)
    text = generate_text(
        prompt=data["prompt"],
        max_tokens=data.get("max_tokens", 200),
        temperature=data.get("temperature", 0.8),
    )
    return JsonResponse({"text": text})


@require_POST
def embed_view(request):
    data = json.loads(request.body)
    embeddings = get_embeddings(data["texts"])
    return JsonResponse({"embeddings": [emb.tolist() for emb in embeddings]})

Complete Production Chatbot¶

from mullama import Model, Context, SamplerParams, supports_gpu_offload

class Chatbot:
    def __init__(
        self,
        model_path: str,
        system_prompt: str = "You are a helpful assistant.",
        max_context: int = 4096,
    ):
        gpu_layers = -1 if supports_gpu_offload() else 0
        self.model = Model.load(model_path, n_gpu_layers=gpu_layers)
        self.ctx = Context(self.model, n_ctx=max_context)
        self.system_prompt = system_prompt
        self.messages: list[tuple[str, str]] = [("system", system_prompt)]
        self.params = SamplerParams.precise()

        print(f"Model: {self.model.name or self.model.description}")
        print(f"Architecture: {self.model.architecture}")
        print(f"Parameters: {self.model.n_params:,}")
        print(f"GPU: {'enabled' if supports_gpu_offload() else 'CPU only'}")
        print(f"Context: {self.ctx.n_ctx} tokens")

    def chat(self, user_message: str) -> str:
        """Send a message and get a response."""
        self.messages.append(("user", user_message))

        prompt = self.model.apply_chat_template(self.messages)
        prompt_tokens = self.model.tokenize(prompt, add_bos=False)

        # Trim if approaching context limit
        while len(prompt_tokens) > self.ctx.n_ctx - 500 and len(self.messages) > 2:
            self.messages.pop(1)  # Remove oldest non-system message
            prompt = self.model.apply_chat_template(self.messages)
            prompt_tokens = self.model.tokenize(prompt, add_bos=False)

        self.ctx.clear_cache()
        response = self.ctx.generate(prompt, max_tokens=500, params=self.params)
        self.messages.append(("assistant", response))
        return response

    def reset(self):
        """Reset the conversation history."""
        self.messages = [("system", self.system_prompt)]
        self.ctx.clear_cache()

    @property
    def history(self) -> list[tuple[str, str]]:
        """Get conversation history (excluding system prompt)."""
        return self.messages[1:]


def main():
    bot = Chatbot("./llama-3.2-1b-instruct.Q4_K_M.gguf")
    print("---")
    print("Type /quit to exit, /reset to clear history.\n")

    while True:
        try:
            user_input = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            break

        if not user_input:
            continue
        if user_input == "/quit":
            break
        if user_input == "/reset":
            bot.reset()
            print("[Conversation reset]\n")
            continue

        response = bot.chat(user_input)
        print(f"\nAssistant: {response}\n")


if __name__ == "__main__":
    main()

Complete RAG Example with NumPy¶

import numpy as np
from mullama import (
    Model,
    Context,
    EmbeddingGenerator,
    SamplerParams,
    cosine_similarity,
)


class RAGPipeline:
    def __init__(self, embedding_model_path: str, generation_model_path: str):
        # Separate models for embedding and generation
        self.emb_model = Model.load(embedding_model_path)
        self.gen_model = Model.load(generation_model_path, n_gpu_layers=-1)
        self.embed_gen = EmbeddingGenerator(self.emb_model, n_ctx=512, normalize=True)

        self.documents: list[str] = []
        self.embeddings: np.ndarray | None = None

    def add_documents(self, docs: list[str]) -> None:
        """Add documents to the index."""
        new_embeddings = self.embed_gen.embed_batch(docs)
        new_matrix = np.stack(new_embeddings)

        self.documents.extend(docs)
        if self.embeddings is None:
            self.embeddings = new_matrix
        else:
            self.embeddings = np.vstack([self.embeddings, new_matrix])

        print(f"Indexed {len(docs)} documents (total: {len(self.documents)})")

    def retrieve(self, query: str, top_k: int = 3) -> list[dict]:
        """Retrieve the most relevant documents for a query."""
        if self.embeddings is None:
            return []

        query_emb = self.embed_gen.embed(query)

        # Efficient similarity computation with NumPy
        scores = self.embeddings @ query_emb  # dot product (normalized = cosine sim)

        top_indices = np.argsort(scores)[::-1][:top_k]
        return [
            {"text": self.documents[i], "score": float(scores[i])}
            for i in top_indices
        ]

    def answer(self, question: str, top_k: int = 3) -> dict:
        """Answer a question using retrieved context."""
        relevant = self.retrieve(question, top_k)

        context = "\n".join(
            f"[{i+1}] {r['text']}" for i, r in enumerate(relevant)
        )

        messages = [
            ("system", "Answer the question based only on the provided context. "
                       "Cite sources using [n] notation. If the context does not "
                       "contain the answer, say so."),
            ("user", f"Context:\n{context}\n\nQuestion: {question}"),
        ]

        prompt = self.gen_model.apply_chat_template(messages)
        ctx = Context(self.gen_model, n_ctx=2048)
        answer = ctx.generate(prompt, max_tokens=300, params=SamplerParams.precise())

        return {
            "answer": answer,
            "sources": relevant,
        }

    def save_index(self, path: str) -> None:
        """Save the embedding index to disk."""
        if self.embeddings is not None:
            np.save(path, self.embeddings)

    def load_index(self, path: str, documents: list[str]) -> None:
        """Load a pre-computed embedding index."""
        self.embeddings = np.load(path)
        self.documents = documents


# Usage
rag = RAGPipeline(
    "./nomic-embed-text-v1.5.Q4_K_M.gguf",
    "./llama-3.2-1b-instruct.Q4_K_M.gguf",
)

rag.add_documents([
    "Mullama is a Rust library for running LLMs locally with native performance.",
    "Mullama supports the GGUF model format developed by the llama.cpp project.",
    "GPU offloading accelerates inference by running model layers on the GPU.",
    "The Python bindings use PyO3 for zero-overhead native integration.",
    "Embeddings can be used for semantic search and retrieval-augmented generation.",
    "Streaming generation returns tokens one by one as they are produced.",
    "Models can be quantized to Q4_K_M for the best balance of speed and quality.",
])

result = rag.answer("How do I use GPU acceleration with Mullama?")
print(f"Answer: {result['answer']}")
print("\nSources:")
for src in result["sources"]:
    print(f"  [{src['score']:.4f}] {src['text']}")

# Save for later
rag.save_index("mullama_docs.npy")

Async Support¶

While the core mullama API is synchronous, you can use it with asyncio by running generation in a thread pool:

import asyncio
from concurrent.futures import ThreadPoolExecutor
from mullama import Model, Context, SamplerParams

model = Model.load("./model.gguf", n_gpu_layers=-1)
executor = ThreadPoolExecutor(max_workers=4)


async def generate_async(prompt: str, max_tokens: int = 200) -> str:
    """Run generation in a thread pool to avoid blocking the event loop."""
    loop = asyncio.get_event_loop()
    ctx = Context(model, n_ctx=2048)
    params = SamplerParams(temperature=0.8)
    return await loop.run_in_executor(
        executor,
        lambda: ctx.generate(prompt, max_tokens=max_tokens, params=params),
    )


async def main():
    # Run multiple generations concurrently
    tasks = [
        generate_async("What is Python?"),
        generate_async("What is Rust?"),
        generate_async("What is Go?"),
    ]
    results = await asyncio.gather(*tasks)
    for result in results:
        print(result[:100], "...\n")


asyncio.run(main())

Error Handling¶

Python bindings raise standard exceptions:

Exception	When
`RuntimeError`	Model loading, context creation, generation, embedding failures
`ValueError`	Invalid input (wrong types, mismatched vector lengths)
`TypeError`	Wrong argument types

from mullama import Model, Context, SamplerParams, cosine_similarity
import numpy as np

# Model loading error
try:
    model = Model.load("./nonexistent.gguf")
except RuntimeError as e:
    print(f"Failed to load: {e}")
    # "Failed to load model: No such file or directory: ./nonexistent.gguf"

# Generation error
try:
    model = Model.load("./model.gguf")
    ctx = Context(model)
    result = ctx.generate("", max_tokens=0)
except RuntimeError as e:
    print(f"Generation error: {e}")

# Vector mismatch
try:
    a = np.array([1.0, 2.0, 3.0], dtype=np.float32)
    b = np.array([1.0, 2.0], dtype=np.float32)
    cosine_similarity(a, b)
except ValueError as e:
    print(f"Similarity error: {e}")
    # "Vectors must have the same length"

Memory Management¶

Context Manager Pattern¶

from mullama import Model, Context, SamplerParams

# Models persist for the lifetime of the variable
model = Model.load("./model.gguf", n_gpu_layers=-1)

# Contexts can be created and discarded freely
def generate_response(prompt: str) -> str:
    ctx = Context(model, n_ctx=2048)
    result = ctx.generate(prompt, max_tokens=200)
    # ctx is garbage collected here
    return result

Explicit Cleanup¶

from mullama import backend_free

# For long-running applications, explicitly free resources at shutdown
import atexit
atexit.register(backend_free)

Memory Tips¶

Model objects are the largest memory consumers. Load once and share across threads.
Context objects use memory proportional to n_ctx. Use the smallest context size that fits your use case.
Embeddings are returned as NumPy arrays that can be freed by letting them go out of scope or calling del.
The Python garbage collector handles all cleanup automatically, but backend_free() ensures deterministic release.

Performance Tips¶

GPU offloading -- set n_gpu_layers=-1 to offload all layers for maximum speed.
Reuse models -- model loading is expensive (seconds). Load once at application startup and create contexts as needed.
NumPy integration -- embedding results are already NumPy arrays. Use NumPy/SciPy operations directly without conversion for maximum efficiency.
Batch embeddings -- embed_batch() is significantly more efficient than calling embed() in a loop due to reduced Python-to-Rust boundary crossings.
Thread count -- the default n_threads=0 auto-detects CPU cores. Override only if you need to limit CPU usage for other workloads.
Context size -- use the smallest n_ctx that fits your use case to reduce memory usage.
Memory mapping -- keep use_mmap=True (default) for faster loading and efficient memory sharing between processes.
Thread pool for async -- wrap synchronous calls in asyncio.to_thread() or ThreadPoolExecutor when using with async frameworks.
NumPy matrix operations -- for large-scale similarity search, stack embeddings into a matrix and use matrix multiplication instead of per-vector cosine similarity.

Python Bindings¶

Installation¶

Building from Source¶

Type Stubs¶

Quick Start¶

API Reference¶

Model¶

Model.load(path, ...)¶

model.tokenize(text, add_bos=True, special=False)¶

model.detokenize(tokens, remove_special=False, unparse_special=False)¶

model.apply_chat_template(messages, add_generation_prompt=True)¶

model.metadata()¶

model.token_is_eog(token)¶

Properties¶

Context¶

Context(model, ...)¶

ctx.generate(prompt, max_tokens=100, params=None)¶

ctx.generate_stream(prompt, max_tokens=100, params=None)¶

ctx.decode(tokens)¶

ctx.clear_cache()¶

ctx.get_embeddings()¶

Properties¶

SamplerParams¶

SamplerParams(...)¶

Preset Class Methods¶

EmbeddingGenerator¶

EmbeddingGenerator(model, n_ctx=512, normalize=True)¶

gen.embed(text)¶

gen.embed_batch(texts)¶

Properties¶

Utility Functions¶

cosine_similarity(a, b)¶

backend_init()¶

backend_free()¶

supports_gpu_offload()¶

system_info()¶

max_devices()¶

Examples¶

Streaming Output¶

Embeddings and Semantic Search¶

NumPy Integration for Embeddings¶

Chat Conversations¶

FastAPI Server with Streaming SSE¶

Flask Server¶

Django Integration¶

Complete Production Chatbot¶

Complete RAG Example with NumPy¶

Async Support¶

Error Handling¶

Memory Management¶

Context Manager Pattern¶

Explicit Cleanup¶

Memory Tips¶

Performance Tips¶

`Model.load(path, ...)`¶

`model.tokenize(text, add_bos=True, special=False)`¶

`model.detokenize(tokens, remove_special=False, unparse_special=False)`¶

`model.apply_chat_template(messages, add_generation_prompt=True)`¶

`model.metadata()`¶

`model.token_is_eog(token)`¶

`Context(model, ...)`¶

`ctx.generate(prompt, max_tokens=100, params=None)`¶

`ctx.generate_stream(prompt, max_tokens=100, params=None)`¶

`ctx.decode(tokens)`¶

`ctx.clear_cache()`¶

`ctx.get_embeddings()`¶

`SamplerParams(...)`¶

`EmbeddingGenerator(model, n_ctx=512, normalize=True)`¶

`gen.embed(text)`¶

`gen.embed_batch(texts)`¶

`cosine_similarity(a, b)`¶

`backend_init()`¶

`backend_free()`¶

`supports_gpu_offload()`¶

`system_info()`¶

`max_devices()`¶