Voice Assistant¶

Build a real-time voice assistant that captures audio, detects speech, processes it through a local LLM, and streams responses -- all running privately with no cloud dependencies.

What You'll Build¶

A voice-to-text-to-response pipeline that:

Captures audio in real-time from the microphone
Detects speech using Voice Activity Detection (VAD)
Processes audio through the multimodal pipeline
Generates streaming LLM responses
Manages conversation context across turns
Handles errors gracefully with automatic recovery

Prerequisites¶

A chat-capable GGUF model (instruct-tuned recommended)
Audio system dependencies:

Linux (Ubuntu/Debian)macOS

sudo apt install -y libasound2-dev libpulse-dev libflac-dev libvorbis-dev libopus-dev

# CoreAudio is available by default

Rust toolchain (this tutorial uses Rust primarily since audio capture requires native access)
Features: streaming-audio, multimodal, async, streaming

mullama pull llama3.2:1b  # Pull a model for generation

Architecture Overview¶

Microphone --> Audio Capture --> Ring Buffer --> VAD (Voice Activity Detection)
                                                      |
                                                      v (speech detected)
                                               Audio Processing --> Multimodal Processor
                                                                          |
                                                                          v
                                                                   Transcription
                                                                          |
                                                                          v
                                                                  LLM Generation --> Streaming Output

Key components:

StreamingAudioProcessor -- Real-time audio capture with ring buffer
Voice Activity Detection (VAD) -- Detects when the user is speaking
MultimodalProcessor -- Converts audio to text (speech-to-text)
TokenStream -- Generates and streams the LLM response

Step 1: Audio Capture Configuration¶

Configure audio capture parameters for speech recognition.

use mullama::{StreamingAudioProcessor, AudioStreamConfig};

// Configure audio for speech recognition
let audio_config = AudioStreamConfig::new()
    .sample_rate(16000)              // 16kHz -- standard for speech
    .channels(1)                      // Mono audio
    .chunk_duration_ms(100)           // 100ms chunks for low latency
    .enable_voice_detection(true)     // Built-in VAD
    .vad_threshold(0.2)              // Sensitivity (0.0-1.0, lower = more sensitive)
    .enable_noise_reduction(true)     // Filter background noise
    .silence_timeout_ms(1500);        // End utterance after 1.5s silence

let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;

Configuration Parameters¶

Parameter	Default	Description
`sample_rate`	16000	Audio sample rate in Hz (16kHz for speech)
`channels`	1	Number of audio channels (mono for speech)
`chunk_duration_ms`	100	Size of each audio chunk in milliseconds
`enable_voice_detection`	`true`	Enable Voice Activity Detection
`vad_threshold`	0.2	VAD sensitivity (lower = more sensitive)
`enable_noise_reduction`	`true`	Filter background noise
`silence_timeout_ms`	1500	Silence duration to end an utterance

Step 2: Voice Activity Detection¶

Listen for speech and capture complete utterances.

use mullama::{StreamingAudioProcessor, AudioStreamConfig, MullamaError};
use futures::StreamExt;

async fn listen_for_speech(
    audio_processor: &mut StreamingAudioProcessor,
) -> Result<Vec<f32>, MullamaError> {
    let mut audio_stream = audio_processor.start_capture().await?;
    let mut speech_buffer: Vec<f32> = Vec::new();
    let mut is_speaking = false;

    println!("Listening...");

    while let Some(chunk) = audio_stream.next().await {
        let processed = audio_processor.process_chunk(&chunk).await?;

        if processed.voice_detected && processed.signal_level > 0.1 {
            if !is_speaking {
                println!("[Speech detected]");
                is_speaking = true;
            }
            speech_buffer.extend_from_slice(&processed.audio_data);
        } else if is_speaking {
            // Silence after speech -- utterance complete
            println!("[End of utterance: {} samples]", speech_buffer.len());
            break;
        }
    }

    Ok(speech_buffer)
}

Step 3: Speech-to-Text¶

Convert captured audio to text using the multimodal processor.

use mullama::{MultimodalProcessor, AudioInput};

async fn transcribe_audio(
    multimodal: &MultimodalProcessor,
    audio_data: &[f32],
    sample_rate: u32,
) -> Result<String, MullamaError> {
    // Create audio input from captured samples
    let audio_input = AudioInput::from_samples(audio_data, sample_rate, 1)?;

    // Process through the multimodal pipeline
    let result = multimodal.process_audio(&audio_input).await?;

    match result.transcript {
        Some(text) if !text.trim().is_empty() => {
            println!("Transcribed: \"{}\"", text);
            Ok(text)
        }
        _ => {
            println!("(no speech recognized)");
            Ok(String::new())
        }
    }
}

Step 4: Generate Streaming Response¶

Generate a response using the transcribed text.

use mullama::{AsyncModel, StreamConfig, TokenStream};
use futures::StreamExt;
use std::io::Write;

async fn generate_response(
    model: &AsyncModel,
    transcript: &str,
    history: &str,
) -> Result<String, MullamaError> {
    let prompt = format!("{}User: {}\nAssistant:", history, transcript);

    let config = StreamConfig::default()
        .max_tokens(200)       // Keep responses concise for voice
        .temperature(0.7)
        .top_k(40);

    let mut stream = TokenStream::new(model.clone(), &prompt, config).await?;
    let mut response = String::new();

    print!("Assistant: ");
    while let Some(token) = stream.next().await {
        let token = token?;
        if response.contains("\nUser:") { break; }
        print!("{}", token.text);
        std::io::stdout().flush().unwrap();
        response.push_str(&token.text);
        if token.is_final { break; }
    }
    println!();

    Ok(response.trim().to_string())
}

Step 5: Conversation Loop¶

Tie all components together in a continuous voice interaction loop.

use mullama::prelude::*;
use mullama::{
    AsyncModel, StreamingAudioProcessor, AudioStreamConfig,
    MultimodalProcessor, AudioInput, StreamConfig, TokenStream,
};
use futures::StreamExt;

async fn run_voice_assistant() -> Result<(), MullamaError> {
    // Initialize model
    let model = AsyncModel::load("path/to/model.gguf").await?;

    // Initialize audio
    let audio_config = AudioStreamConfig::new()
        .sample_rate(16000)
        .enable_voice_detection(true)
        .vad_threshold(0.2)
        .enable_noise_reduction(true)
        .silence_timeout_ms(1500);
    let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;

    // Initialize multimodal processor
    let multimodal = MultimodalProcessor::new()
        .enable_audio_processing()
        .build();

    // Conversation state
    let mut history = String::from(
        "System: You are a helpful voice assistant. Keep responses concise \
         and conversational since they will be spoken aloud.\n\n"
    );

    println!("Voice Assistant ready! Start speaking...");
    println!("(Press Ctrl+C to exit)\n");

    loop {
        // Listen for speech
        let audio_data = listen_for_speech(&mut audio_processor).await?;
        if audio_data.is_empty() { continue; }

        // Transcribe
        let transcript = transcribe_audio(&multimodal, &audio_data, 16000).await?;
        if transcript.is_empty() { continue; }

        println!("You: {}", transcript);

        // Generate response
        let response = generate_response(&model, &transcript, &history).await?;

        // Update history (keep bounded)
        history.push_str(&format!("User: {}\nAssistant: {}\n", transcript, response));
        if history.len() > 3000 {
            let trim_point = history.len() - 2000;
            if let Some(pos) = history[trim_point..].find('\n') {
                history = history[trim_point + pos..].to_string();
            }
        }

        println!();
    }
}

Latency Optimization¶

For real-time voice interaction, minimize latency at each stage:

Stage	Target Latency	Tips
Audio capture	< 100ms	Use small chunk sizes (50-100ms)
VAD detection	< 50ms	Lower threshold for faster detection
Transcription	< 500ms	Use smaller/quantized speech models
LLM first token	< 200ms	Use small models, GPU acceleration
Total round-trip	< 1.5s	Overlap processing stages

// Optimize for low latency
let audio_config = AudioStreamConfig::new()
    .chunk_duration_ms(50)           // Smaller chunks = faster detection
    .vad_threshold(0.15)             // More sensitive = faster trigger
    .silence_timeout_ms(1000);       // Shorter timeout = faster response

let stream_config = StreamConfig::default()
    .max_tokens(100)                 // Shorter responses for voice
    .temperature(0.6);               // Slightly more focused

Error Handling¶

Handle audio and processing errors gracefully with retry logic.

async fn robust_voice_loop(
    audio_processor: &mut StreamingAudioProcessor,
    multimodal: &MultimodalProcessor,
    model: &AsyncModel,
) -> Result<(), MullamaError> {
    let mut consecutive_errors = 0;

    loop {
        match process_one_turn(audio_processor, multimodal, model).await {
            Ok(_) => consecutive_errors = 0,
            Err(MullamaError::AudioError(e)) => {
                eprintln!("Audio error: {}. Retrying...", e);
                consecutive_errors += 1;
                if consecutive_errors >= 5 {
                    eprintln!("Restarting audio system...");
                    audio_processor.restart().await?;
                    consecutive_errors = 0;
                }
                tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
            }
            Err(MullamaError::ModelError(e)) => {
                eprintln!("Model error: {}. Skipping turn.", e);
                consecutive_errors += 1;
            }
            Err(e) => return Err(e),
        }

        if consecutive_errors >= 10 {
            return Err(MullamaError::SystemError("Too many errors".into()));
        }
    }
}

Ring Buffer Architecture¶

The StreamingAudioProcessor uses a ring buffer for zero-allocation audio handling:

Audio Input (continuous stream from microphone)
    |
    v
+---+---+---+---+---+---+---+---+
| C | C | C | C | C | C | C | C |  Ring Buffer (pre-allocated chunks)
+---+---+---+---+---+---+---+---+
        ^               ^
        |               |
      Read            Write
      Pointer         Pointer

Benefits:

No allocations during capture (pre-allocated buffer)
Constant latency regardless of processing speed
Overflow protection -- oldest data is overwritten if processing falls behind

Complete Working Example¶

use mullama::prelude::*;
use mullama::{
    AsyncModel, StreamingAudioProcessor, AudioStreamConfig,
    MultimodalProcessor, AudioInput, StreamConfig, TokenStream,
};
use futures::StreamExt;
use std::io::Write;

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    println!("Mullama Voice Assistant");
    println!("=======================\n");

    #[cfg(all(
        feature = "streaming-audio",
        feature = "multimodal",
        feature = "async",
        feature = "streaming"
    ))]
    {
        let model_path = std::env::var("MODEL_PATH")
            .unwrap_or_else(|_| "path/to/model.gguf".to_string());

        // Load model
        println!("Loading model...");
        let model = AsyncModel::load(&model_path).await?;
        println!("Model loaded!");

        // Configure audio
        let audio_config = AudioStreamConfig::new()
            .sample_rate(16000)
            .channels(1)
            .enable_voice_detection(true)
            .vad_threshold(0.2)
            .enable_noise_reduction(true)
            .silence_timeout_ms(1500);
        let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;

        // Configure multimodal processor
        let multimodal = MultimodalProcessor::new()
            .enable_audio_processing()
            .build();

        let mut history = String::from(
            "System: You are a helpful, concise voice assistant.\n\n"
        );

        println!("\n--- Ready. Speak into your microphone. Ctrl+C to exit. ---\n");

        loop {
            // Capture speech
            let mut audio_stream = audio_processor.start_capture().await?;
            let mut speech_buffer: Vec<f32> = Vec::new();
            let mut heard_speech = false;

            while let Some(chunk) = audio_stream.next().await {
                let processed = audio_processor.process_chunk(&chunk).await?;
                if processed.voice_detected && processed.signal_level > 0.1 {
                    if !heard_speech {
                        print!("[listening] ");
                        std::io::stdout().flush().unwrap();
                        heard_speech = true;
                    }
                    speech_buffer.extend_from_slice(&processed.audio_data);
                } else if heard_speech {
                    println!("done");
                    break;
                }
            }

            if speech_buffer.is_empty() { continue; }

            // Transcribe
            let audio_input = AudioInput::from_samples(&speech_buffer, 16000, 1)?;
            let result = multimodal.process_audio(&audio_input).await?;
            let transcript = match result.transcript {
                Some(text) if !text.trim().is_empty() => text,
                _ => { println!("(no speech recognized)"); continue; }
            };
            println!("You: {}", transcript);

            // Generate response
            let prompt = format!("{}User: {}\nAssistant:", history, transcript);
            let config = StreamConfig::default()
                .max_tokens(150)
                .temperature(0.7);

            let mut stream = TokenStream::new(model.clone(), &prompt, config).await?;
            let mut response = String::new();

            print!("Assistant: ");
            while let Some(token) = stream.next().await {
                let token = token?;
                if response.contains("\nUser:") { break; }
                print!("{}", token.text);
                std::io::stdout().flush().unwrap();
                response.push_str(&token.text);
                if token.is_final { break; }
            }
            println!("\n");

            // Update history
            history.push_str(&format!(
                "User: {}\nAssistant: {}\n", transcript, response.trim()
            ));
            if history.len() > 3000 {
                let trim = history.len() - 2000;
                if let Some(p) = history[trim..].find('\n') {
                    history = history[trim + p..].to_string();
                }
            }
        }
    }

    #[cfg(not(all(
        feature = "streaming-audio", feature = "multimodal",
        feature = "async", feature = "streaming"
    )))]
    {
        println!("This example requires features: streaming-audio, multimodal, async, streaming");
        println!("Run with:");
        println!("  cargo run --example voice_assistant --features \"streaming-audio,multimodal,async,streaming\"");
    }

    Ok(())
}

Extension: Text-to-Speech Output¶

Add TTS output so the assistant speaks its responses aloud. This requires an external TTS engine since Mullama focuses on inference:

// Example using system TTS (platform-specific)
fn speak_response(text: &str) {
    #[cfg(target_os = "macos")]
    {
        std::process::Command::new("say")
            .arg(text)
            .spawn()
            .expect("Failed to invoke macOS TTS");
    }

    #[cfg(target_os = "linux")]
    {
        std::process::Command::new("espeak-ng")
            .arg(text)
            .spawn()
            .expect("Failed to invoke espeak-ng. Install with: sudo apt install espeak-ng");
    }
}

What's Next¶

Multimodal Processing -- Understand the multimodal pipeline in depth
Streaming Generation -- Deep dive into token streaming patterns
API Server -- Expose voice capabilities over HTTP/WebSocket
Advanced: Streaming Audio -- Low-level audio architecture