Process images and audio alongside text using vision-language models (VLMs) and audio-language models. Mullama supports multimodal inference with a unified API across image and audio inputs.
Feature Gate
In Rust, enable the multimodal feature flag:
[dependencies]mullama={version="0.3",features=["multimodal"]}# For audio processing, also enable streaming-audiomullama={version="0.3",features=["multimodal","streaming-audio"]}
Node.js and Python include multimodal support by default.
The MultimodalProcessor is the central API for processing text, images, and audio together. It handles:
Loading and preprocessing images into the format expected by vision encoders
Converting audio into the format expected by audio encoders
Combining multimodal inputs with text prompts
Routing to the appropriate encoder based on input type
import{Model,MultimodalProcessor}from'mullama';constmodel=awaitModel.load('./llava-v1.6.gguf');constprocessor=newMultimodalProcessor(model);constresponse=awaitprocessor.generate({text:"Describe this image in detail.",images:['./photo.jpg'],});console.log(response);
frommullamaimportModel,MultimodalProcessormodel=Model.load("./llava-v1.6.gguf")processor=MultimodalProcessor(model)response=processor.generate(text="Describe this image in detail.",images=["./photo.jpg"],)print(response)
usemullama::{Model,MultimodalProcessor};usestd::sync::Arc;letmodel=Arc::new(Model::load("llava-v1.6.gguf")?);letmutprocessor=MultimodalProcessor::new(model)?;letresponse=processor.generate_with_image("Describe this image in detail.","photo.jpg",200)?;println!("{}",response);
mullamarunllava:13b"Describe this image in detail."--image./photo.jpg
Process images from memory (useful for web applications receiving uploads):
import{readFileSync}from'fs';constimageBuffer=readFileSync('./photo.jpg');constresponse=awaitprocessor.generate({text:"Describe this image.",imageBuffers:[imageBuffer],});
withopen("./photo.jpg","rb")asf:image_buffer=f.read()response=processor.generate(text="Describe this image.",image_buffers=[image_buffer],)
letimage_data=std::fs::read("photo.jpg")?;letresponse=processor.generate_with_image_buffer("Describe this image.",&image_data,200)?;
# Pipe image data via stdincatphoto.jpg|mullamarunllava:13b"Describe this image."--image-
Process raw pixel data (useful for video frames or generated images):
import{RawImage}from'mullama';// Raw RGBA pixelsconstpixels=newUint8Array(width*height*4);constimage=newRawImage(pixels,width,height,'rgba');constresponse=awaitprocessor.generate({text:"What do you see?",rawImages:[image],});
importnumpyasnpfrommullamaimportRawImage# Raw RGBA pixels (numpy array)pixels=np.zeros((height,width,4),dtype=np.uint8)image=RawImage(pixels,width,height,"rgba")response=processor.generate(text="What do you see?",raw_images=[image],)
usemullama::multimodal::RawImage;letpixels:Vec<u8>=vec![0;width*height*4];letimage=RawImage::new(&pixels,width,height,4)?;letresponse=processor.generate_with_raw_image("What do you see?",&image,200)?;
# Raw pixel input is not available via CLI# Use file path or buffer insteadmullamarunllava:13b"What do you see?"--image./image.png
Images are automatically resized and normalized to match the vision encoder's expected input dimensions. The original aspect ratio is preserved with padding. No manual preprocessing is required.
Process audio alongside text for speech understanding tasks:
import{Model,MultimodalProcessor}from'mullama';constmodel=awaitModel.load('./audio-model.gguf');constprocessor=newMultimodalProcessor(model);constresponse=awaitprocessor.generate({text:"Transcribe this audio.",audio:['./recording.wav'],});console.log(response);
frommullamaimportModel,MultimodalProcessormodel=Model.load("./audio-model.gguf")processor=MultimodalProcessor(model)response=processor.generate(text="Transcribe this audio.",audio=["./recording.wav"],)print(response)
usemullama::{Model,MultimodalProcessor};letmodel=Arc::new(Model::load("audio-model.gguf")?);letmutprocessor=MultimodalProcessor::new(model)?;letresponse=processor.generate_with_audio("Transcribe this audio.","recording.wav",500)?;println!("{}",response);
mullamarunaudio-model"Transcribe this audio."--audio./recording.wav
Audio is automatically resampled to the model's expected sample rate (typically 16 kHz) and converted to mono if needed. The format-conversion feature handles all necessary format transformations.
Process audio from memory (e.g., from a microphone stream):
import{AudioBuffer}from'mullama';// Float32 PCM samples at 16kHzconstsamples=newFloat32Array(16000*5);// 5 secondsconstaudio=newAudioBuffer(samples,16000,1);constresponse=awaitprocessor.generate({text:"What was said?",audioBuffers:[audio],});
importnumpyasnpfrommullamaimportAudioBuffer# Float32 PCM samples at 16kHzsamples=np.zeros(16000*5,dtype=np.float32)# 5 secondsaudio=AudioBuffer(samples,sample_rate=16000,channels=1)response=processor.generate(text="What was said?",audio_buffers=[audio],)
usemullama::multimodal::AudioBuffer;letsamples:Vec<f32>=vec![0.0;16000*5];// 5 secondsletaudio=AudioBuffer::new(&samples,16000,1)?;letresponse=processor.generate_with_audio_buffer("What was said?",&audio,500)?;
# Record and processmullamarunaudio-model"What was said?"--audio-devicedefault--duration5
constresponse=awaitprocessor.generate({text:"Provide a detailed description of this image including colors, objects, and composition.",images:['./landscape.jpg'],maxTokens:500,});
response=processor.generate(text="Provide a detailed description of this image including colors, objects, and composition.",images=["./landscape.jpg"],max_tokens=500,)
letresponse=processor.generate_with_image("Provide a detailed description of this image including colors, objects, and composition.","landscape.jpg",500)?;
mullamarunllava:13b\"Provide a detailed description of this image including colors, objects, and composition."\--image./landscape.jpg--max-tokens500
constquestions=["How many people are in this image?","What is the dominant color?","Is this indoors or outdoors?",];for(constquestionofquestions){constanswer=awaitprocessor.generate({text:question,images:['./photo.jpg'],maxTokens:100,});console.log(`Q: ${question}\nA: ${answer}\n`);}
questions=["How many people are in this image?","What is the dominant color?","Is this indoors or outdoors?",]forquestioninquestions:answer=processor.generate(text=question,images=["./photo.jpg"],max_tokens=100,)print(f"Q: {question}\nA: {answer}\n")
letquestions=vec!["How many people are in this image?","What is the dominant color?","Is this indoors or outdoors?",];forquestionin&questions{letanswer=processor.generate_with_image(question,"photo.jpg",100)?;println!("Q: {}\nA: {}\n",question,answer);}
mullamarunllava:13b"How many people are in this image?"--image./photo.jpg
mullamarunllava:13b"What is the dominant color?"--image./photo.jpg
mullamarunllava:13b"Is this indoors or outdoors?"--image./photo.jpg
constresponse=awaitprocessor.generate({text:"Extract all text visible in this image. Format it as a list.",images:['./document.png'],maxTokens:1000,});console.log(response);
response=processor.generate(text="Extract all text visible in this image. Format it as a list.",images=["./document.png"],max_tokens=1000,)print(response)
letresponse=processor.generate_with_image("Extract all text visible in this image. Format it as a list.","document.png",1000)?;
mullamarunllava:13b\"Extract all text visible in this image. Format it as a list."\--image./document.png--max-tokens1000
consttranscription=awaitprocessor.generate({text:"Transcribe the following audio word for word.",audio:['./speech.wav'],maxTokens:1000,});console.log(transcription);
transcription=processor.generate(text="Transcribe the following audio word for word.",audio=["./speech.wav"],max_tokens=1000,)print(transcription)
lettranscription=processor.generate_with_audio("Transcribe the following audio word for word.","speech.wav",1000)?;
mullamarunaudio-model"Transcribe the following audio word for word."\--audio./speech.wav--max-tokens1000