Use Cases¶

Mullama's modular feature system enables a wide range of AI-powered applications. This guide covers common deployment patterns, each with a problem statement, solution architecture, code snippet, required features, and links to related tutorials.

1. Conversational AI¶

Problem: Build a responsive chatbot with streaming responses that maintains conversation context across turns, with real-time token delivery for a natural chat experience.

Solution: Use Mullama's streaming generation with KV-cache management for multi-turn conversations. Chat templates handle proper prompt formatting for instruction-tuned models.

Architecture:

use mullama::{Model, Context, ContextParams, SamplerParams, ChatMessage};
use mullama::streaming::StreamConfig;
use std::sync::Arc;

let model = Arc::new(Model::load("llama-3-8b-instruct.gguf")?);
let mut ctx = Context::new(model.clone(), ContextParams {
    n_ctx: 4096,
    ..Default::default()
})?;

let mut history: Vec<ChatMessage> = vec![
    ChatMessage { role: "system".into(), content: "You are a helpful assistant.".into() },
];

// Each turn: format prompt, tokenize, decode, stream response
loop {
    let user_input = get_user_input();
    history.push(ChatMessage { role: "user".into(), content: user_input });

    let prompt = model.apply_chat_template(&history, true)?;
    let tokens = model.tokenize(&prompt, false, true)?;

    ctx.clear_cache(); // Reset for new conversation context
    ctx.decode(&tokens)?;

    let mut response = String::new();
    let mut sampler = SamplerParams::default().build_chain(model.clone())?;
    for _ in 0..512 {
        let token = sampler.sample(&mut ctx, -1);
        sampler.accept(token);
        if model.token_is_eog(token) { break; }
        let text = model.token_to_str(token, 0, false)?;
        print!("{}", text); // Stream to user
        response.push_str(&text);
        ctx.decode_single(token)?;
    }

    history.push(ChatMessage { role: "assistant".into(), content: response });
}

Features used: Core (no additional features required for basic chat; add streaming for async streaming, web for HTTP API)

Related: Streaming Guide | Chatbot Example

2. Document Q&A (RAG)¶

Problem: Answer questions about a large document corpus by finding relevant passages and generating grounded answers, avoiding hallucination by providing source context.

Solution: Use Mullama's embedding API to index documents, find similar passages via cosine similarity, then generate answers with retrieved context inserted into the prompt.

Architecture:

use mullama::{Model, Context, ContextParams};
use mullama::embedding::{EmbeddingGenerator, EmbeddingConfig, PoolingStrategy, cosine_similarity};
use std::sync::Arc;

// Step 1: Index documents with embedding model
let embed_model = Arc::new(Model::load("nomic-embed.gguf")?);
let config = EmbeddingConfig {
    pooling: PoolingStrategy::Mean,
    normalize: true,
    batch_size: 32,
};
let mut embedder = EmbeddingGenerator::new(embed_model, config)?;

let documents = load_documents("./docs/")?;
let doc_embeddings: Vec<Vec<f32>> = documents.iter()
    .map(|doc| embedder.embed_text(doc))
    .collect::<Result<_, _>>()?;

// Step 2: Query with embedding similarity
let query = "How does memory management work?";
let query_embedding = embedder.embed_text(query)?;

let mut ranked: Vec<(usize, f32)> = doc_embeddings.iter()
    .enumerate()
    .map(|(i, emb)| (i, cosine_similarity(&query_embedding, emb)))
    .collect();
ranked.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());

// Step 3: Generate answer with top-3 context
let context_docs: String = ranked.iter().take(3)
    .map(|(i, _)| documents[*i].as_str())
    .collect::<Vec<_>>()
    .join("\n---\n");

let gen_model = Arc::new(Model::load("llama-3-8b-instruct.gguf")?);
let mut ctx = Context::new(gen_model.clone(), ContextParams::default())?;

let prompt = format!(
    "Based on the following context, answer the question.\n\n\
     Context:\n{}\n\nQuestion: {}\n\nAnswer:",
    context_docs, query
);
let tokens = gen_model.tokenize(&prompt, true, false)?;
let answer = ctx.generate(&tokens, 300)?;

Features used: Core (embeddings are in the core module)

Related: Embeddings Guide | RAG Example

3. Voice Interface¶

Problem: Build a voice-activated AI assistant that listens for speech, processes audio input, and generates spoken or text responses in real time.

Solution: Use Mullama's streaming audio processor for real-time voice capture with voice activity detection, then feed audio features into a multimodal model for transcription and response generation.

Architecture:

use mullama::streaming_audio::{StreamingAudioProcessor, AudioStreamConfig};
use mullama::multimodal::{MultimodalProcessor, AudioInput};
use mullama::{AsyncModel, StreamConfig};

// Configure audio capture
let audio_config = AudioStreamConfig {
    sample_rate: 16000,
    channels: 1,
    buffer_size: 4096,
    vad_enabled: true,
    noise_reduction: true,
    ..Default::default()
};

let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;
let model = AsyncModel::load("whisper-llama.gguf").await?;

// Listen for speech
loop {
    let audio_chunk = audio_processor.capture_utterance().await?;

    // Process audio to text
    let audio_input = AudioInput::from_samples(audio_chunk.samples, 16000);
    let transcript = model.generate(&format!(
        "Transcribe: <audio>{}</audio>", audio_input.to_features()?
    ), 200).await?;

    println!("You said: {}", transcript);

    // Generate response
    let response = model.generate(&transcript, 200).await?;
    println!("Assistant: {}", response);
}

Features used: multimodal, streaming-audio, async

4. Local API Server¶

Problem: Replace cloud-based LLM APIs (OpenAI, Anthropic) with a local server that offers the same REST interface but runs entirely on-premise, ensuring data privacy and eliminating per-token costs.

Solution: Use Mullama's web integration to serve an OpenAI-compatible API with Axum, supporting both synchronous and streaming completions.

Architecture:

use mullama::{AsyncModel, AsyncConfig};
use mullama::web::{AppState, RouterBuilder, GenerateRequest, GenerateResponse};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), mullama::MullamaError> {
    let model = AsyncModel::load_with_config(AsyncConfig {
        model_path: "llama-3-8b-instruct.gguf".to_string(),
        gpu_layers: 32,
        context_size: 4096,
        max_concurrent: 4,
        ..Default::default()
    }).await?;

    let state = AppState::new(Arc::new(model));

    let app = RouterBuilder::new(state)
        .add_completions()          // POST /v1/completions
        .add_chat_completions()     // POST /v1/chat/completions
        .add_embeddings()           // POST /v1/embeddings
        .add_models()               // GET /v1/models
        .add_health()               // GET /health
        .cors_permissive()
        .build();

    println!("Serving on http://127.0.0.1:8080");
    axum::serve(
        tokio::net::TcpListener::bind("127.0.0.1:8080").await?,
        app,
    ).await?;

    Ok(())
}

Features used: async, web, streaming

Related: API Server Example | Streaming API

5. Edge AI (Raspberry Pi / Embedded)¶

Problem: Deploy an AI model on resource-constrained edge devices (Raspberry Pi, Jetson Nano) for offline inference with minimal memory footprint and no cloud dependency.

Solution: Use Mullama's CPU-only mode with aggressive quantization (Q4_0 model, Q4_0 KV-cache) and minimal thread count to fit within device constraints.

Architecture:

use mullama::{Model, Context, ContextParams, ModelParams, SamplerParams, KvCacheType};
use std::sync::Arc;

// Load with minimal resource usage
let params = ModelParams {
    n_gpu_layers: 0,       // CPU only
    use_mmap: true,        // Memory-mapped for low RSS
    use_mlock: false,      // Don't lock RAM
    ..Default::default()
};

let model = Arc::new(Model::load_with_params("tinyllama-1.1b-q4_0.gguf", params)?);

// Minimal context for edge deployment
let ctx_params = ContextParams {
    n_ctx: 512,            // Short context saves memory
    n_batch: 128,          // Small batch for low latency
    n_threads: 4,          // Match Pi's core count
    type_k: KvCacheType::Q4_0,  // 75% KV-cache savings
    type_v: KvCacheType::Q4_0,
    ..Default::default()
};

let mut ctx = Context::new(model.clone(), ctx_params)?;

// Generate with constrained resources
let tokens = model.tokenize("Sensor reading: 23.5C. Status:", true, false)?;
let sampler = SamplerParams::precise(); // Low temperature for factual output
let output = ctx.generate_with_params(&tokens, 50, &sampler)?;
println!("AI: {}", output);

Features used: Core only (no additional features for minimal binary size)

Related: Models Guide | Platform Setup

6. Content Generation (Batch Processing)¶

Problem: Generate large volumes of content (product descriptions, article summaries, translations) efficiently by processing many prompts in parallel.

Solution: Use Mullama's multi-threaded architecture with shared model and per-thread contexts to process prompts concurrently, maximizing throughput.

Architecture:

use mullama::{Model, Context, ContextParams, SamplerParams};
use std::sync::Arc;
use std::thread;

let model = Arc::new(Model::load("llama-3-8b-instruct.gguf")?);

let prompts = vec![
    "Write a product description for: wireless headphones",
    "Write a product description for: mechanical keyboard",
    "Write a product description for: ergonomic mouse",
    "Write a product description for: USB-C hub",
    // ... hundreds more
];

// Process in parallel using thread pool
let chunk_size = prompts.len() / num_cpus::get();
let handles: Vec<_> = prompts.chunks(chunk_size)
    .map(|chunk| {
        let model = model.clone();
        let chunk = chunk.to_vec();
        thread::spawn(move || {
            let mut ctx = Context::new(model.clone(), ContextParams {
                n_ctx: 1024,
                n_threads: 2, // Fewer threads per context when parallelizing
                ..Default::default()
            }).unwrap();

            let sampler = SamplerParams {
                temperature: 0.7,
                penalty_repeat: 1.15,
                ..Default::default()
            };

            chunk.iter().map(|prompt| {
                ctx.clear_cache();
                let tokens = model.tokenize(prompt, true, false).unwrap();
                ctx.generate_with_params(&tokens, 200, &sampler).unwrap()
            }).collect::<Vec<String>>()
        })
    })
    .collect();

let results: Vec<String> = handles.into_iter()
    .flat_map(|h| h.join().unwrap())
    .collect();

println!("Generated {} descriptions", results.len());

Features used: Core (multi-threading via standard library; add parallel for Rayon integration)

Related: Batch Example | Generation Guide

7. Semantic Search (Embeddings + ColBERT)¶

Problem: Build a high-quality semantic search engine that goes beyond keyword matching, using dense retrieval with optional late-interaction (ColBERT-style) for better ranking precision.

Solution: Use Mullama's embedding API with Mean pooling for passage retrieval, and late-interaction multi-vector embeddings for re-ranking the top candidates.

Architecture:

use mullama::{Model, embedding::{EmbeddingGenerator, EmbeddingConfig, PoolingStrategy, cosine_similarity}};
use std::sync::Arc;

// Dense retrieval with single-vector embeddings
let embed_model = Arc::new(Model::load("nomic-embed.gguf")?);
let config = EmbeddingConfig {
    pooling: PoolingStrategy::Mean,
    normalize: true,
    batch_size: 64,
};
let mut embedder = EmbeddingGenerator::new(embed_model.clone(), config)?;

// Index corpus
let corpus = load_corpus("./data/documents.jsonl")?;
let corpus_embeddings = embedder.embed_batch(
    &corpus.iter().map(|s| s.as_str()).collect::<Vec<_>>()
)?;

// Search function
fn search(
    query: &str,
    embedder: &mut EmbeddingGenerator,
    corpus: &[String],
    corpus_embeddings: &[Vec<f32>],
    top_k: usize,
) -> Vec<(usize, f32)> {
    let query_emb = embedder.embed_text(query).unwrap();

    let mut scores: Vec<(usize, f32)> = corpus_embeddings.iter()
        .enumerate()
        .map(|(i, doc_emb)| (i, cosine_similarity(&query_emb, doc_emb)))
        .collect();

    scores.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
    scores.truncate(top_k);
    scores
}

// Execute search
let results = search("rust memory safety", &mut embedder, &corpus, &corpus_embeddings, 10);
for (idx, score) in &results {
    println!("{:.4}: {}", score, &corpus[*idx][..80]);
}

Features used: Core (embeddings); add late-interaction for ColBERT-style multi-vector retrieval

Related: Embeddings Guide | Embeddings API

8. Code Assistant (Structured Output)¶

Problem: Build a code generation assistant that produces syntactically valid, structured output (JSON, SQL, code) by constraining the model's generation to a formal grammar.

Solution: Use Mullama's grammar-constrained sampling with GBNF grammars to guarantee output conformance, combined with code-specific models and infill support.

Architecture:

use mullama::{Model, Context, ContextParams, SamplerParams};
use mullama::sampling::{SamplerChain, Sampler};
use std::sync::Arc;

let model = Arc::new(Model::load("codellama-7b-instruct.gguf")?);
let mut ctx = Context::new(model.clone(), ContextParams {
    n_ctx: 4096,
    ..Default::default()
})?;

// Define JSON schema grammar
let json_schema_grammar = r#"
root        ::= object
object      ::= "{" ws pairs ws "}"
pairs       ::= pair ("," ws pair)*
pair        ::= string ":" ws value
value       ::= string | number | object | array | "true" | "false" | "null"
array       ::= "[" ws (value ("," ws value)*)? ws "]"
string      ::= "\"" ([^"\\] | "\\" .)* "\""
number      ::= "-"? [0-9]+ ("." [0-9]+)? ([eE] [+-]? [0-9]+)?
ws          ::= [ \t\n]*
"#;

// Build sampler chain with grammar constraint
let mut chain = SamplerChain::with_defaults();
chain.add(Sampler::grammar(model.clone(), json_schema_grammar, "root")?);
chain.add(Sampler::temperature(0.3)?); // Low temp for precise code
chain.add(Sampler::dist(42)?);

// Generate structured output
let prompt = "Generate a JSON config for a web server with host, port, and routes:";
let tokens = model.tokenize(prompt, true, false)?;
ctx.decode(&tokens)?;

let mut output = String::new();
for _ in 0..500 {
    let token = chain.sample(&mut ctx, -1);
    chain.accept(token);
    if model.token_is_eog(token) { break; }
    let text = model.token_to_str(token, 0, false)?;
    output.push_str(&text);
    ctx.decode_single(token)?;
}

// Output is guaranteed to be valid JSON
let parsed: serde_json::Value = serde_json::from_str(&output)?;
println!("{}", serde_json::to_string_pretty(&parsed)?);

Features used: Core (grammar sampling is part of the sampling module)

Related: Structured Output Guide | Sampling API

Feature Decision Diagram¶

Use this guide to determine which features to enable based on your use case:

What You Need	Feature to Enable	Key Types
Basic text generation	(none - core)	`Model`, `Context`, `SamplerParams`
Text embeddings	(none - core)	`EmbeddingGenerator`, `PoolingStrategy`
Grammar-constrained output	(none - core)	`Sampler::grammar`
Non-blocking inference	`async`	`AsyncModel`, `AsyncContext`
Real-time token streaming	`streaming`	`TokenStream`, `StreamConfig`
Image understanding	`multimodal`	`VisionEncoder`, `MultimodalProcessor`
Audio processing	`multimodal`	`AudioInput`, `AudioFeatures`
Live microphone input	`streaming-audio`	`StreamingAudioProcessor`
REST API server	`web`	`AppState`, `RouterBuilder`
WebSocket connections	`websockets`	`WebSocketServer`
Parallel batch processing	`parallel`	Rayon-based operations
ColBERT retrieval	`late-interaction`	Multi-vector embeddings
Background model daemon	`daemon`	Daemon CLI commands
All features	`full`	Everything above

Quick Start by Use Case¶

# Chatbot
mullama = { version = "0.3", features = ["streaming"] }

# RAG / Semantic search
mullama = { version = "0.3" }  # Core is sufficient

# Voice assistant
mullama = { version = "0.3", features = ["multimodal", "streaming-audio"] }

# API server
mullama = { version = "0.3", features = ["web", "streaming"] }

# Edge deployment (minimal binary)
mullama = { version = "0.3", default-features = false }

# Batch processing
mullama = { version = "0.3", features = ["parallel"] }

# Everything
mullama = { version = "0.3", features = ["full"] }