Skip to content

Your First Project

Build a working chatbot from scratch using Mullama. This tutorial covers model loading, text generation, streaming output, multi-turn conversation, and deploying as an API.


Prerequisites

Before starting, ensure you have:

Time to Complete

This tutorial takes approximately 10-15 minutes to complete.


Step 1: Download a Model

First, download a small model for development. The Llama 3.2 1B model is fast and requires minimal resources.

# Install the daemon if you have not already
cargo install mullama --features daemon

# Pull a small, fast model
mullama pull llama3.2:1b

# Verify it downloaded
mullama list
mkdir -p models/
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
    -O models/llama3.2-1b.gguf

Model Path

Throughout this tutorial, we use models/llama3.2-1b.gguf as the model path. If you used mullama pull, the model is stored in ~/.mullama/models/ -- adjust the path accordingly or create a symlink:

ln -s ~/.mullama/models/llama3.2-1b.gguf models/llama3.2-1b.gguf

Step 2: Create Your Project

mkdir my-chatbot && cd my-chatbot
npm init -y
npm install mullama
mkdir my-chatbot && cd my-chatbot
python -m venv .venv
source .venv/bin/activate
pip install mullama
cargo new my-chatbot && cd my-chatbot
cargo add mullama --features "async,streaming"
cargo add tokio --features full

Step 3: Simple Generation

Start with a basic one-shot generation to verify everything works.

index.js
const { Model } = require('mullama');

async function main() {
    // Load the model
    console.log('Loading model...');
    const model = new Model('models/llama3.2-1b.gguf', {
        contextSize: 2048,
        nGpuLayers: -1  // Use GPU if available
    });
    console.log('Model loaded!');

    // Generate a response
    const response = await model.generate(
        'What is the capital of France? Answer in one sentence.',
        { maxTokens: 100 }
    );

    console.log('Response:', response);

    // Clean up
    model.close();
}

main().catch(console.error);

Run it:

node index.js
main.py
from mullama import Model

def main():
    # Load the model
    print("Loading model...")
    model = Model("models/llama3.2-1b.gguf",
        context_size=2048,
        n_gpu_layers=-1  # Use GPU if available
    )
    print("Model loaded!")

    # Generate a response
    response = model.generate(
        "What is the capital of France? Answer in one sentence.",
        max_tokens=100
    )

    print(f"Response: {response}")

if __name__ == "__main__":
    main()

Run it:

python main.py
src/main.rs
use mullama::prelude::*;

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    // Load the model
    println!("Loading model...");
    let model = ModelBuilder::new()
        .path("models/llama3.2-1b.gguf")
        .context_size(2048)
        .n_gpu_layers(-1)  // Use GPU if available
        .build().await?;
    println!("Model loaded!");

    // Generate a response
    let response = model.generate(
        "What is the capital of France? Answer in one sentence.",
        100
    ).await?;

    println!("Response: {}", response);
    Ok(())
}

Run it:

cargo run

Expected output:

Loading model...
Model loaded!
Response: The capital of France is Paris, a city renowned for its art, culture, and history.

Step 4: Add Streaming Output

Instead of waiting for the entire response, stream tokens as they are generated for a more interactive experience.

streaming.js
const { Model } = require('mullama');

async function main() {
    const model = new Model('models/llama3.2-1b.gguf', {
        contextSize: 2048,
        nGpuLayers: -1
    });

    console.log('Generating with streaming...\n');

    // Stream tokens one by one
    const stream = model.stream(
        'Write a haiku about programming.',
        { maxTokens: 100 }
    );

    for await (const token of stream) {
        process.stdout.write(token);
    }

    console.log('\n\nDone!');
    model.close();
}

main().catch(console.error);

Run it:

node streaming.js
streaming.py
from mullama import Model

def main():
    model = Model("models/llama3.2-1b.gguf",
        context_size=2048,
        n_gpu_layers=-1
    )

    print("Generating with streaming...\n")

    # Stream tokens one by one
    for token in model.stream(
        "Write a haiku about programming.",
        max_tokens=100
    ):
        print(token, end="", flush=True)

    print("\n\nDone!")

if __name__ == "__main__":
    main()

Run it:

python streaming.py
src/main.rs
use mullama::prelude::*;
use tokio_stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    let model = ModelBuilder::new()
        .path("models/llama3.2-1b.gguf")
        .context_size(2048)
        .n_gpu_layers(-1)
        .build().await?;

    println!("Generating with streaming...\n");

    // Stream tokens one by one
    let mut stream = model.stream(
        "Write a haiku about programming.",
        100
    ).await?;

    while let Some(token) = stream.next().await {
        let token = token?;
        print!("{}", token);
    }

    println!("\n\nDone!");
    Ok(())
}

Run it:

cargo run

You should see tokens appearing one by one in your terminal, creating a typewriter effect.


Step 5: Multi-Turn Conversation

Build an interactive chatbot that maintains conversation history across multiple turns.

chatbot.js
const { Model } = require('mullama');
const readline = require('readline');

async function main() {
    const model = new Model('models/llama3.2-1b.gguf', {
        contextSize: 4096,
        nGpuLayers: -1
    });

    const rl = readline.createInterface({
        input: process.stdin,
        output: process.stdout
    });

    // Conversation history
    const messages = [
        { role: 'system', content: 'You are a helpful assistant. Be concise and friendly.' }
    ];

    console.log('Chatbot ready! Type "quit" to exit.\n');

    const askQuestion = () => {
        rl.question('You: ', async (input) => {
            if (input.toLowerCase() === 'quit') {
                console.log('Goodbye!');
                model.close();
                rl.close();
                return;
            }

            // Add user message to history
            messages.push({ role: 'user', content: input });

            // Generate response with streaming
            process.stdout.write('Assistant: ');
            let fullResponse = '';

            const stream = model.chat(messages, {
                maxTokens: 500,
                stream: true
            });

            for await (const token of stream) {
                process.stdout.write(token);
                fullResponse += token;
            }

            console.log('\n');

            // Add assistant response to history
            messages.push({ role: 'assistant', content: fullResponse });

            askQuestion();
        });
    };

    askQuestion();
}

main().catch(console.error);

Run it:

node chatbot.js
chatbot.py
from mullama import Model

def main():
    model = Model("models/llama3.2-1b.gguf",
        context_size=4096,
        n_gpu_layers=-1
    )

    # Conversation history
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Be concise and friendly."}
    ]

    print("Chatbot ready! Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            print("Goodbye!")
            break

        # Add user message to history
        messages.append({"role": "user", "content": user_input})

        # Generate response with streaming
        print("Assistant: ", end="", flush=True)
        full_response = ""

        for token in model.chat(messages, max_tokens=500, stream=True):
            print(token, end="", flush=True)
            full_response += token

        print("\n")

        # Add assistant response to history
        messages.append({"role": "assistant", "content": full_response})

if __name__ == "__main__":
    main()

Run it:

python chatbot.py
src/main.rs
use mullama::prelude::*;
use tokio_stream::StreamExt;
use std::io::{self, Write, BufRead};

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    let model = ModelBuilder::new()
        .path("models/llama3.2-1b.gguf")
        .context_size(4096)
        .n_gpu_layers(-1)
        .build().await?;

    // Conversation history
    let mut messages = vec![
        ChatMessage::system("You are a helpful assistant. Be concise and friendly.")
    ];

    println!("Chatbot ready! Type \"quit\" to exit.\n");

    let stdin = io::stdin();
    loop {
        print!("You: ");
        io::stdout().flush()?;

        let mut input = String::new();
        stdin.lock().read_line(&mut input)?;
        let input = input.trim();

        if input.eq_ignore_ascii_case("quit") {
            println!("Goodbye!");
            break;
        }

        // Add user message to history
        messages.push(ChatMessage::user(input));

        // Generate response with streaming
        print!("Assistant: ");
        io::stdout().flush()?;

        let mut full_response = String::new();
        let mut stream = model.chat_stream(&messages, 500).await?;

        while let Some(token) = stream.next().await {
            let token = token?;
            print!("{}", token);
            io::stdout().flush()?;
            full_response.push_str(&token);
        }

        println!("\n");

        // Add assistant response to history
        messages.push(ChatMessage::assistant(&full_response));
    }

    Ok(())
}

Run it:

cargo run

Example conversation:

Chatbot ready! Type "quit" to exit.

You: What is Rust?
Assistant: Rust is a systems programming language focused on safety, speed, and concurrency.
It prevents common bugs like null pointer dereferences and data races at compile time.

You: How does it compare to C++?
Assistant: Compared to C++, Rust offers memory safety without a garbage collector through
its ownership system. It has a steeper learning curve initially but catches more bugs at
compile time. Performance is comparable to C++ in most benchmarks.

You: quit
Goodbye!

Step 6: Deploy as an API Server

Turn your chatbot into a REST API that other applications can consume.

npm install express
server.js
const express = require('express');
const { Model } = require('mullama');

const app = express();
app.use(express.json());

// Load model once at startup
const model = new Model('models/llama3.2-1b.gguf', {
    contextSize: 4096,
    nGpuLayers: -1
});

// Simple generation endpoint
app.post('/api/generate', async (req, res) => {
    const { prompt, maxTokens = 200 } = req.body;

    if (!prompt) {
        return res.status(400).json({ error: 'prompt is required' });
    }

    try {
        const response = await model.generate(prompt, { maxTokens });
        res.json({ response });
    } catch (error) {
        res.status(500).json({ error: error.message });
    }
});

// Chat endpoint with conversation history
app.post('/api/chat', async (req, res) => {
    const { messages, maxTokens = 500 } = req.body;

    if (!messages || !Array.isArray(messages)) {
        return res.status(400).json({ error: 'messages array is required' });
    }

    try {
        const response = await model.chat(messages, { maxTokens });
        res.json({
            message: { role: 'assistant', content: response }
        });
    } catch (error) {
        res.status(500).json({ error: error.message });
    }
});

// Streaming endpoint using Server-Sent Events
app.post('/api/chat/stream', async (req, res) => {
    const { messages, maxTokens = 500 } = req.body;

    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache');
    res.setHeader('Connection', 'keep-alive');

    try {
        const stream = model.chat(messages, {
            maxTokens,
            stream: true
        });

        for await (const token of stream) {
            res.write(`data: ${JSON.stringify({ token })}\n\n`);
        }

        res.write('data: [DONE]\n\n');
        res.end();
    } catch (error) {
        res.write(`data: ${JSON.stringify({ error: error.message })}\n\n`);
        res.end();
    }
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
    console.log(`Mullama API server running on http://localhost:${PORT}`);
    console.log('Endpoints:');
    console.log('  POST /api/generate  - Simple text generation');
    console.log('  POST /api/chat      - Chat with conversation history');
    console.log('  POST /api/chat/stream - Streaming chat (SSE)');
});

Run the server:

node server.js

Test it:

# Simple generation
curl -X POST http://localhost:3000/api/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is 2+2?", "maxTokens": 50}'

# Chat with history
curl -X POST http://localhost:3000/api/chat \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello, who are you?"}
        ]
    }'
pip install fastapi uvicorn
server.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from mullama import Model
import json

app = FastAPI(title="Mullama Chat API")

# Load model once at startup
model = Model("models/llama3.2-1b.gguf",
    context_size=4096,
    n_gpu_layers=-1
)

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 200

class Message(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: list[Message]
    max_tokens: int = 500
    stream: bool = False

@app.post("/api/generate")
async def generate(req: GenerateRequest):
    """Simple text generation."""
    response = model.generate(req.prompt, max_tokens=req.max_tokens)
    return {"response": response}

@app.post("/api/chat")
async def chat(req: ChatRequest):
    """Chat with conversation history."""
    messages = [{"role": m.role, "content": m.content} for m in req.messages]

    if req.stream:
        # Return Server-Sent Events
        async def event_stream():
            for token in model.chat(messages, max_tokens=req.max_tokens, stream=True):
                yield f"data: {json.dumps({'token': token})}\n\n"
            yield "data: [DONE]\n\n"

        return StreamingResponse(
            event_stream(),
            media_type="text/event-stream"
        )

    response = model.chat(messages, max_tokens=req.max_tokens)
    return {"message": {"role": "assistant", "content": response}}

if __name__ == "__main__":
    import uvicorn
    print("Mullama API server running on http://localhost:8000")
    print("Docs available at http://localhost:8000/docs")
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run the server:

python server.py

Test it:

# Simple generation
curl -X POST http://localhost:8000/api/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is 2+2?", "max_tokens": 50}'

# Chat with history
curl -X POST http://localhost:8000/api/chat \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello, who are you?"}
        ]
    }'

# Interactive docs
open http://localhost:8000/docs
Cargo.toml
[dependencies]
mullama = { version = "0.3.0", features = ["web", "streaming"] }
tokio = { version = "1", features = ["full"] }
axum = "0.7"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tower-http = { version = "0.5", features = ["cors"] }
src/main.rs
use axum::{
    extract::State,
    response::sse::{Event, Sse},
    routing::post,
    Json, Router,
};
use mullama::prelude::*;
use serde::{Deserialize, Serialize};
use std::sync::Arc;
use tokio_stream::StreamExt;

#[derive(Clone)]
struct AppState {
    model: Arc<Model>,
}

#[derive(Deserialize)]
struct GenerateRequest {
    prompt: String,
    #[serde(default = "default_max_tokens")]
    max_tokens: usize,
}

#[derive(Deserialize)]
struct ChatRequest {
    messages: Vec<ChatMsg>,
    #[serde(default = "default_max_tokens")]
    max_tokens: usize,
}

#[derive(Deserialize, Serialize)]
struct ChatMsg {
    role: String,
    content: String,
}

#[derive(Serialize)]
struct GenerateResponse {
    response: String,
}

fn default_max_tokens() -> usize { 200 }

async fn generate(
    State(state): State<AppState>,
    Json(req): Json<GenerateRequest>,
) -> Json<GenerateResponse> {
    let response = state.model
        .generate(&req.prompt, req.max_tokens)
        .await
        .unwrap_or_else(|e| format!("Error: {}", e));

    Json(GenerateResponse { response })
}

async fn chat_stream(
    State(state): State<AppState>,
    Json(req): Json<ChatRequest>,
) -> Sse<impl tokio_stream::Stream<Item = Result<Event, std::convert::Infallible>>> {
    let messages: Vec<ChatMessage> = req.messages.iter().map(|m| {
        match m.role.as_str() {
            "system" => ChatMessage::system(&m.content),
            "assistant" => ChatMessage::assistant(&m.content),
            _ => ChatMessage::user(&m.content),
        }
    }).collect();

    let stream = state.model
        .chat_stream(&messages, req.max_tokens)
        .await
        .unwrap();

    let event_stream = stream.map(|token| {
        let token = token.unwrap_or_default();
        Ok(Event::default().data(token))
    });

    Sse::new(event_stream)
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("Loading model...");
    let model = ModelBuilder::new()
        .path("models/llama3.2-1b.gguf")
        .context_size(4096)
        .n_gpu_layers(-1)
        .build().await?;

    let state = AppState {
        model: Arc::new(model),
    };

    let app = Router::new()
        .route("/api/generate", post(generate))
        .route("/api/chat/stream", post(chat_stream))
        .with_state(state);

    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await?;
    println!("Mullama API server running on http://localhost:3000");
    axum::serve(listener, app).await?;
    Ok(())
}

Run the server:

cargo run --release

Complete Working Examples

Here are the complete, production-ready versions of everything we built.

complete-chatbot.js
const { Model } = require('mullama');
const readline = require('readline');

async function main() {
    // Configuration
    const MODEL_PATH = process.env.MODEL_PATH || 'models/llama3.2-1b.gguf';
    const CONTEXT_SIZE = parseInt(process.env.CONTEXT_SIZE || '4096');
    const MAX_TOKENS = parseInt(process.env.MAX_TOKENS || '500');

    // Load model
    console.log(`Loading model from ${MODEL_PATH}...`);
    const model = new Model(MODEL_PATH, {
        contextSize: CONTEXT_SIZE,
        nGpuLayers: -1
    });
    console.log('Model loaded successfully!\n');

    // Setup readline
    const rl = readline.createInterface({
        input: process.stdin,
        output: process.stdout
    });

    // System prompt
    const messages = [{
        role: 'system',
        content: 'You are a helpful, concise assistant. Answer questions clearly and directly.'
    }];

    console.log('=== Mullama Chatbot ===');
    console.log('Commands: "quit" to exit, "clear" to reset history\n');

    const prompt = () => {
        rl.question('You: ', async (input) => {
            const trimmed = input.trim();

            if (!trimmed) { prompt(); return; }
            if (trimmed === 'quit') { model.close(); rl.close(); return; }
            if (trimmed === 'clear') {
                messages.length = 1;  // Keep system prompt
                console.log('[History cleared]\n');
                prompt();
                return;
            }

            messages.push({ role: 'user', content: trimmed });

            process.stdout.write('Assistant: ');
            let response = '';

            try {
                for await (const token of model.chat(messages, {
                    maxTokens: MAX_TOKENS,
                    stream: true
                })) {
                    process.stdout.write(token);
                    response += token;
                }
            } catch (err) {
                console.error(`\n[Error: ${err.message}]`);
            }

            console.log('\n');
            if (response) {
                messages.push({ role: 'assistant', content: response });
            }

            prompt();
        });
    };

    prompt();
}

main().catch(console.error);
complete_chatbot.py
import os
import sys
from mullama import Model

def main():
    # Configuration
    model_path = os.environ.get("MODEL_PATH", "models/llama3.2-1b.gguf")
    context_size = int(os.environ.get("CONTEXT_SIZE", "4096"))
    max_tokens = int(os.environ.get("MAX_TOKENS", "500"))

    # Load model
    print(f"Loading model from {model_path}...")
    model = Model(model_path,
        context_size=context_size,
        n_gpu_layers=-1
    )
    print("Model loaded successfully!\n")

    # System prompt
    messages = [{
        "role": "system",
        "content": "You are a helpful, concise assistant. Answer questions clearly and directly."
    }]

    print("=== Mullama Chatbot ===")
    print('Commands: "quit" to exit, "clear" to reset history\n')

    while True:
        try:
            user_input = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nGoodbye!")
            break

        if not user_input:
            continue
        if user_input == "quit":
            print("Goodbye!")
            break
        if user_input == "clear":
            messages = messages[:1]  # Keep system prompt
            print("[History cleared]\n")
            continue

        messages.append({"role": "user", "content": user_input})

        print("Assistant: ", end="", flush=True)
        response = ""

        try:
            for token in model.chat(messages, max_tokens=max_tokens, stream=True):
                print(token, end="", flush=True)
                response += token
        except Exception as e:
            print(f"\n[Error: {e}]")

        print("\n")
        if response:
            messages.append({"role": "assistant", "content": response})

if __name__ == "__main__":
    main()

What You Built

In this tutorial you learned how to:

  1. Download and load a GGUF model with Mullama
  2. Generate text with configurable parameters
  3. Stream tokens for real-time output
  4. Maintain conversation history for multi-turn chat
  5. Deploy as an API using Express (Node.js), FastAPI (Python), or Axum (Rust)

What's Next

  • Streaming Guide


    Advanced streaming with backpressure, cancellation, and custom callbacks.

  • Sampling Strategies


    Control generation quality with temperature, top-p, top-k, and repetition penalties.

  • Multimodal


    Process images and audio alongside text for vision and voice models.

  • Embeddings


    Generate text embeddings for semantic search and RAG pipelines.

  • Web Framework


    Build production REST APIs with Axum integration, middleware, and auth.

  • Language Bindings


    Deep dive into Node.js, Python, Go, PHP, and C/C++ binding APIs.

Explore More Examples

Check out the Tutorials & Examples section for complete projects including RAG pipelines, voice assistants, batch processing, and edge deployment.