Multimodal Processing¶

Process images alongside text using vision-language models (VLMs). This tutorial covers image captioning, visual question answering, supported formats, image preprocessing, and batch processing.

What You'll Build¶

A multimodal processing pipeline that:

Loads and processes images in various formats (JPEG, PNG, WebP)
Generates image captions and descriptions
Answers questions about image content
Preprocesses images for optimal model performance
Processes multiple images in batch
Handles vision model loading and configuration

Prerequisites¶

Mullama with multimodal feature enabled
A vision-capable GGUF model (e.g., LLaVA, BakLLaVA)
Image processing dependencies:

Linux (Ubuntu/Debian)macOS

sudo apt install -y libpng-dev libjpeg-dev libtiff-dev libwebp-dev

brew install libpng jpeg-turbo webp

Rust toolchain (multimodal is primarily a Rust-native feature)
Features: multimodal, optionally format-conversion

# Pull a vision-capable model
mullama pull llava:7b

Vision Model Requirements¶

Not all models support multimodal input. You need a vision-language model:

Model	Size	Description
`llava:7b`	4.7 GB	LLaVA 1.5 -- general-purpose vision
`llava:13b`	8.1 GB	LLaVA 1.5 -- higher quality
`bakllava:7b`	4.7 GB	BakLLaVA -- improved architecture
`llava-llama3:8b`	5.0 GB	LLaVA with Llama 3 backbone

Vision models consist of two components:

Vision encoder -- Processes images into embeddings (e.g., CLIP ViT)
Language model -- Generates text conditioned on vision embeddings

Step 1: Load a Vision Model¶

use mullama::{MultimodalProcessor, ImageInput};
use std::path::Path;

// Initialize the multimodal processor with a vision model
let processor = MultimodalProcessor::new()
    .model_path("./llava-7b.Q4_K_M.gguf")
    .enable_image_processing()
    .n_ctx(2048)
    .n_gpu_layers(-1)  // GPU acceleration
    .build()?;

println!("Vision model loaded: {}", processor.model_name());
println!("Image support: {}", processor.supports_images());

Step 2: Image Loading and Preprocessing¶

Load images and prepare them for the vision encoder.

use mullama::{ImageInput, ImageFormat};

// Load image from file
let image = ImageInput::from_file("./photo.jpg")?;
println!("Image size: {}x{}", image.width(), image.height());
println!("Format: {:?}", image.format());

// Load from bytes (e.g., from HTTP request)
let bytes = std::fs::read("./photo.png")?;
let image = ImageInput::from_bytes(&bytes, ImageFormat::Png)?;

// Load from URL (requires format-conversion feature)
let image = ImageInput::from_url("https://example.com/image.jpg").await?;

// Preprocessing: resize for optimal model input
let preprocessed = image
    .resize(336, 336)         // Standard CLIP input size
    .normalize()              // Normalize pixel values
    .to_rgb()?;               // Ensure RGB format

println!("Preprocessed: {}x{} RGB", preprocessed.width(), preprocessed.height());

Supported Image Formats¶

Format	Extension	Notes
JPEG	`.jpg`, `.jpeg`	Lossy, most common
PNG	`.png`	Lossless, supports transparency
WebP	`.webp`	Modern format, good compression
TIFF	`.tif`, `.tiff`	High quality, scientific imaging
BMP	`.bmp`	Uncompressed, large files

Step 3: Image Captioning¶

Generate descriptive captions for images.

use mullama::{MultimodalProcessor, ImageInput, SamplerParams};

fn caption_image(processor: &MultimodalProcessor, image_path: &str) -> Result<String, MullamaError> {
    let image = ImageInput::from_file(image_path)?;

    // Simple captioning prompt
    let caption = processor.generate_with_image(
        &image,
        "Describe this image in detail.",
        200,  // max_tokens
        Some(SamplerParams {
            temperature: 0.3,    // Low temperature for factual descriptions
            top_p: 0.9,
            ..Default::default()
        }),
    )?;

    println!("Caption: {}", caption.trim());
    Ok(caption.trim().to_string())
}

// Usage
let caption = caption_image(&processor, "./sunset.jpg")?;

Step 4: Visual Question Answering¶

Ask questions about image content.

fn visual_qa(
    processor: &MultimodalProcessor,
    image_path: &str,
    question: &str,
) -> Result<String, MullamaError> {
    let image = ImageInput::from_file(image_path)?;

    let answer = processor.generate_with_image(
        &image,
        question,
        150,
        Some(SamplerParams {
            temperature: 0.2,    // Very focused for factual answers
            top_k: 20,
            ..Default::default()
        }),
    )?;

    println!("Q: {}", question);
    println!("A: {}", answer.trim());
    Ok(answer.trim().to_string())
}

// Ask different questions about the same image
let image_path = "./street_scene.jpg";
visual_qa(&processor, image_path, "How many people are in this image?")?;
visual_qa(&processor, image_path, "What is the weather like?")?;
visual_qa(&processor, image_path, "What colors are prominent?")?;
visual_qa(&processor, image_path, "Is this indoors or outdoors?")?;

Step 5: Streaming Image Responses¶

Stream generated text for longer descriptions.

use mullama::{MultimodalProcessor, ImageInput, StreamConfig, TokenStream};
use futures::StreamExt;
use std::io::Write;

async fn stream_image_description(
    processor: &MultimodalProcessor,
    image_path: &str,
) -> Result<String, MullamaError> {
    let image = ImageInput::from_file(image_path)?;

    let config = StreamConfig::default()
        .max_tokens(300)
        .temperature(0.5);

    let prompt = "Provide a detailed description of this image, including colors, \
                  objects, people, setting, and mood.";

    let mut stream = processor.stream_with_image(&image, prompt, config).await?;
    let mut response = String::new();

    print!("Description: ");
    while let Some(token) = stream.next().await {
        let token = token?;
        print!("{}", token.text);
        std::io::stdout().flush().unwrap();
        response.push_str(&token.text);
        if token.is_final { break; }
    }
    println!();

    Ok(response)
}

Step 6: Batch Image Processing¶

Process multiple images efficiently.

use mullama::{MultimodalProcessor, ImageInput, MullamaError};
use std::path::PathBuf;

fn batch_caption_images(
    processor: &MultimodalProcessor,
    image_paths: &[PathBuf],
    prompt: &str,
) -> Vec<Result<(PathBuf, String), MullamaError>> {
    let start = std::time::Instant::now();
    let mut results = Vec::new();

    for (i, path) in image_paths.iter().enumerate() {
        print!("[{}/{}] Processing: {} ... ",
            i + 1, image_paths.len(), path.display());
        std::io::stdout().flush().unwrap();

        let result = (|| {
            let image = ImageInput::from_file(path)?;
            let caption = processor.generate_with_image(
                &image, prompt, 100,
                Some(SamplerParams { temperature: 0.3, ..Default::default() }),
            )?;
            Ok((path.clone(), caption.trim().to_string()))
        })();

        match &result {
            Ok((_, caption)) => println!("\"{}\"", &caption[..caption.len().min(60)]),
            Err(e) => println!("Error: {}", e),
        }
        results.push(result);
    }

    let elapsed = start.elapsed();
    let successful = results.iter().filter(|r| r.is_ok()).count();
    println!("\nProcessed: {}/{} images in {:.2}s",
        successful, image_paths.len(), elapsed.as_secs_f64());

    results
}

// Usage
let paths: Vec<PathBuf> = std::fs::read_dir("./images/")?
    .filter_map(|e| e.ok())
    .filter(|e| {
        let ext = e.path().extension().and_then(|e| e.to_str()).unwrap_or("");
        matches!(ext, "jpg" | "jpeg" | "png" | "webp")
    })
    .map(|e| e.path())
    .collect();

let results = batch_caption_images(&processor, &paths, "Describe this image briefly:");

Complete Working Example¶

use mullama::prelude::*;
use mullama::{MultimodalProcessor, ImageInput, SamplerParams, MullamaError};
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("Mullama Multimodal Demo");
    println!("=======================\n");

    #[cfg(feature = "multimodal")]
    {
        let model_path = std::env::args().nth(1)
            .unwrap_or_else(|| "llava-7b.Q4_K_M.gguf".to_string());
        let image_path = std::env::args().nth(2)
            .unwrap_or_else(|| "sample.jpg".to_string());

        // Initialize processor
        println!("Loading vision model...");
        let processor = MultimodalProcessor::new()
            .model_path(&model_path)
            .enable_image_processing()
            .n_ctx(2048)
            .n_gpu_layers(-1)
            .build()?;
        println!("Model ready!\n");

        // Load image
        let image = ImageInput::from_file(&image_path)?;
        println!("Image: {} ({}x{})\n", image_path, image.width(), image.height());

        // Caption
        println!("--- Captioning ---");
        let caption = processor.generate_with_image(
            &image,
            "Describe this image in one sentence.",
            100,
            Some(SamplerParams { temperature: 0.3, ..Default::default() }),
        )?;
        println!("Caption: {}\n", caption.trim());

        // Visual QA
        println!("--- Visual QA ---");
        let questions = [
            "What objects can you see?",
            "What colors are prominent?",
            "Describe the mood or atmosphere.",
        ];

        for question in &questions {
            let answer = processor.generate_with_image(
                &image, question, 80,
                Some(SamplerParams { temperature: 0.2, ..Default::default() }),
            )?;
            println!("Q: {}", question);
            println!("A: {}\n", answer.trim());
        }

        // Detailed description
        println!("--- Detailed Description ---");
        let description = processor.generate_with_image(
            &image,
            "Provide a detailed, structured description of this image.",
            300,
            Some(SamplerParams { temperature: 0.5, ..Default::default() }),
        )?;
        println!("{}", description.trim());
    }

    #[cfg(not(feature = "multimodal"))]
    {
        println!("This example requires the 'multimodal' feature.");
        println!("Run with: cargo run --example multimodal --features multimodal");
    }

    Ok(())
}

Python Bindings (Experimental)¶

Multimodal support in Python bindings is under development. Currently you can use the daemon for multimodal processing:

import requests
import base64

# Using the Mullama daemon's multimodal endpoint
def describe_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    response = requests.post("http://localhost:8080/v1/chat/completions", json={
        "model": "llava:7b",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
            ]
        }],
        "max_tokens": 200
    })

    return response.json()["choices"][0]["message"]["content"]

# Usage
caption = describe_image("./photo.jpg")
print(f"Caption: {caption}")

Image Preprocessing Tips¶

Parameter	Recommendation	Reason
Resolution	336x336 or 672x672	Matches CLIP encoder training
Format	RGB	Vision encoders expect 3 channels
Normalization	ImageNet mean/std	Standard for CLIP-based models
Aspect ratio	Preserve with padding	Avoids distortion
File size	< 10 MB	Memory efficiency during loading

Model Compatibility

Not all GGUF models support multimodal input. You need specifically trained vision-language models. Standard text-only models will produce errors when given image input. Check model documentation or metadata for vision or multimodal capabilities.

What's Next¶

Voice Assistant -- Combine vision with audio processing
Batch Processing -- Process image collections at scale
API Server -- Serve multimodal endpoints over HTTP
Guide: Multimodal -- In-depth multimodal architecture