Voice Assistant¶
Build a real-time voice assistant that captures audio, detects speech, processes it through a local LLM, and streams responses -- all running privately with no cloud dependencies.
What You'll Build¶
A voice-to-text-to-response pipeline that:
- Captures audio in real-time from the microphone
- Detects speech using Voice Activity Detection (VAD)
- Processes audio through the multimodal pipeline
- Generates streaming LLM responses
- Manages conversation context across turns
- Handles errors gracefully with automatic recovery
Prerequisites¶
- A chat-capable GGUF model (instruct-tuned recommended)
- Audio system dependencies:
- Rust toolchain (this tutorial uses Rust primarily since audio capture requires native access)
- Features:
streaming-audio,multimodal,async,streaming
Architecture Overview¶
Microphone --> Audio Capture --> Ring Buffer --> VAD (Voice Activity Detection)
|
v (speech detected)
Audio Processing --> Multimodal Processor
|
v
Transcription
|
v
LLM Generation --> Streaming Output
Key components:
StreamingAudioProcessor-- Real-time audio capture with ring buffer- Voice Activity Detection (VAD) -- Detects when the user is speaking
MultimodalProcessor-- Converts audio to text (speech-to-text)TokenStream-- Generates and streams the LLM response
Step 1: Audio Capture Configuration¶
Configure audio capture parameters for speech recognition.
use mullama::{StreamingAudioProcessor, AudioStreamConfig};
// Configure audio for speech recognition
let audio_config = AudioStreamConfig::new()
.sample_rate(16000) // 16kHz -- standard for speech
.channels(1) // Mono audio
.chunk_duration_ms(100) // 100ms chunks for low latency
.enable_voice_detection(true) // Built-in VAD
.vad_threshold(0.2) // Sensitivity (0.0-1.0, lower = more sensitive)
.enable_noise_reduction(true) // Filter background noise
.silence_timeout_ms(1500); // End utterance after 1.5s silence
let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;
Configuration Parameters¶
| Parameter | Default | Description |
|---|---|---|
sample_rate |
16000 | Audio sample rate in Hz (16kHz for speech) |
channels |
1 | Number of audio channels (mono for speech) |
chunk_duration_ms |
100 | Size of each audio chunk in milliseconds |
enable_voice_detection |
true |
Enable Voice Activity Detection |
vad_threshold |
0.2 | VAD sensitivity (lower = more sensitive) |
enable_noise_reduction |
true |
Filter background noise |
silence_timeout_ms |
1500 | Silence duration to end an utterance |
Step 2: Voice Activity Detection¶
Listen for speech and capture complete utterances.
use mullama::{StreamingAudioProcessor, AudioStreamConfig, MullamaError};
use futures::StreamExt;
async fn listen_for_speech(
audio_processor: &mut StreamingAudioProcessor,
) -> Result<Vec<f32>, MullamaError> {
let mut audio_stream = audio_processor.start_capture().await?;
let mut speech_buffer: Vec<f32> = Vec::new();
let mut is_speaking = false;
println!("Listening...");
while let Some(chunk) = audio_stream.next().await {
let processed = audio_processor.process_chunk(&chunk).await?;
if processed.voice_detected && processed.signal_level > 0.1 {
if !is_speaking {
println!("[Speech detected]");
is_speaking = true;
}
speech_buffer.extend_from_slice(&processed.audio_data);
} else if is_speaking {
// Silence after speech -- utterance complete
println!("[End of utterance: {} samples]", speech_buffer.len());
break;
}
}
Ok(speech_buffer)
}
Step 3: Speech-to-Text¶
Convert captured audio to text using the multimodal processor.
use mullama::{MultimodalProcessor, AudioInput};
async fn transcribe_audio(
multimodal: &MultimodalProcessor,
audio_data: &[f32],
sample_rate: u32,
) -> Result<String, MullamaError> {
// Create audio input from captured samples
let audio_input = AudioInput::from_samples(audio_data, sample_rate, 1)?;
// Process through the multimodal pipeline
let result = multimodal.process_audio(&audio_input).await?;
match result.transcript {
Some(text) if !text.trim().is_empty() => {
println!("Transcribed: \"{}\"", text);
Ok(text)
}
_ => {
println!("(no speech recognized)");
Ok(String::new())
}
}
}
Step 4: Generate Streaming Response¶
Generate a response using the transcribed text.
use mullama::{AsyncModel, StreamConfig, TokenStream};
use futures::StreamExt;
use std::io::Write;
async fn generate_response(
model: &AsyncModel,
transcript: &str,
history: &str,
) -> Result<String, MullamaError> {
let prompt = format!("{}User: {}\nAssistant:", history, transcript);
let config = StreamConfig::default()
.max_tokens(200) // Keep responses concise for voice
.temperature(0.7)
.top_k(40);
let mut stream = TokenStream::new(model.clone(), &prompt, config).await?;
let mut response = String::new();
print!("Assistant: ");
while let Some(token) = stream.next().await {
let token = token?;
if response.contains("\nUser:") { break; }
print!("{}", token.text);
std::io::stdout().flush().unwrap();
response.push_str(&token.text);
if token.is_final { break; }
}
println!();
Ok(response.trim().to_string())
}
Step 5: Conversation Loop¶
Tie all components together in a continuous voice interaction loop.
use mullama::prelude::*;
use mullama::{
AsyncModel, StreamingAudioProcessor, AudioStreamConfig,
MultimodalProcessor, AudioInput, StreamConfig, TokenStream,
};
use futures::StreamExt;
async fn run_voice_assistant() -> Result<(), MullamaError> {
// Initialize model
let model = AsyncModel::load("path/to/model.gguf").await?;
// Initialize audio
let audio_config = AudioStreamConfig::new()
.sample_rate(16000)
.enable_voice_detection(true)
.vad_threshold(0.2)
.enable_noise_reduction(true)
.silence_timeout_ms(1500);
let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;
// Initialize multimodal processor
let multimodal = MultimodalProcessor::new()
.enable_audio_processing()
.build();
// Conversation state
let mut history = String::from(
"System: You are a helpful voice assistant. Keep responses concise \
and conversational since they will be spoken aloud.\n\n"
);
println!("Voice Assistant ready! Start speaking...");
println!("(Press Ctrl+C to exit)\n");
loop {
// Listen for speech
let audio_data = listen_for_speech(&mut audio_processor).await?;
if audio_data.is_empty() { continue; }
// Transcribe
let transcript = transcribe_audio(&multimodal, &audio_data, 16000).await?;
if transcript.is_empty() { continue; }
println!("You: {}", transcript);
// Generate response
let response = generate_response(&model, &transcript, &history).await?;
// Update history (keep bounded)
history.push_str(&format!("User: {}\nAssistant: {}\n", transcript, response));
if history.len() > 3000 {
let trim_point = history.len() - 2000;
if let Some(pos) = history[trim_point..].find('\n') {
history = history[trim_point + pos..].to_string();
}
}
println!();
}
}
Latency Optimization¶
For real-time voice interaction, minimize latency at each stage:
| Stage | Target Latency | Tips |
|---|---|---|
| Audio capture | < 100ms | Use small chunk sizes (50-100ms) |
| VAD detection | < 50ms | Lower threshold for faster detection |
| Transcription | < 500ms | Use smaller/quantized speech models |
| LLM first token | < 200ms | Use small models, GPU acceleration |
| Total round-trip | < 1.5s | Overlap processing stages |
// Optimize for low latency
let audio_config = AudioStreamConfig::new()
.chunk_duration_ms(50) // Smaller chunks = faster detection
.vad_threshold(0.15) // More sensitive = faster trigger
.silence_timeout_ms(1000); // Shorter timeout = faster response
let stream_config = StreamConfig::default()
.max_tokens(100) // Shorter responses for voice
.temperature(0.6); // Slightly more focused
Error Handling¶
Handle audio and processing errors gracefully with retry logic.
async fn robust_voice_loop(
audio_processor: &mut StreamingAudioProcessor,
multimodal: &MultimodalProcessor,
model: &AsyncModel,
) -> Result<(), MullamaError> {
let mut consecutive_errors = 0;
loop {
match process_one_turn(audio_processor, multimodal, model).await {
Ok(_) => consecutive_errors = 0,
Err(MullamaError::AudioError(e)) => {
eprintln!("Audio error: {}. Retrying...", e);
consecutive_errors += 1;
if consecutive_errors >= 5 {
eprintln!("Restarting audio system...");
audio_processor.restart().await?;
consecutive_errors = 0;
}
tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
}
Err(MullamaError::ModelError(e)) => {
eprintln!("Model error: {}. Skipping turn.", e);
consecutive_errors += 1;
}
Err(e) => return Err(e),
}
if consecutive_errors >= 10 {
return Err(MullamaError::SystemError("Too many errors".into()));
}
}
}
Ring Buffer Architecture¶
The StreamingAudioProcessor uses a ring buffer for zero-allocation audio handling:
Audio Input (continuous stream from microphone)
|
v
+---+---+---+---+---+---+---+---+
| C | C | C | C | C | C | C | C | Ring Buffer (pre-allocated chunks)
+---+---+---+---+---+---+---+---+
^ ^
| |
Read Write
Pointer Pointer
Benefits:
- No allocations during capture (pre-allocated buffer)
- Constant latency regardless of processing speed
- Overflow protection -- oldest data is overwritten if processing falls behind
Complete Working Example¶
use mullama::prelude::*;
use mullama::{
AsyncModel, StreamingAudioProcessor, AudioStreamConfig,
MultimodalProcessor, AudioInput, StreamConfig, TokenStream,
};
use futures::StreamExt;
use std::io::Write;
#[tokio::main]
async fn main() -> Result<(), MullamaError> {
println!("Mullama Voice Assistant");
println!("=======================\n");
#[cfg(all(
feature = "streaming-audio",
feature = "multimodal",
feature = "async",
feature = "streaming"
))]
{
let model_path = std::env::var("MODEL_PATH")
.unwrap_or_else(|_| "path/to/model.gguf".to_string());
// Load model
println!("Loading model...");
let model = AsyncModel::load(&model_path).await?;
println!("Model loaded!");
// Configure audio
let audio_config = AudioStreamConfig::new()
.sample_rate(16000)
.channels(1)
.enable_voice_detection(true)
.vad_threshold(0.2)
.enable_noise_reduction(true)
.silence_timeout_ms(1500);
let mut audio_processor = StreamingAudioProcessor::new(audio_config)?;
// Configure multimodal processor
let multimodal = MultimodalProcessor::new()
.enable_audio_processing()
.build();
let mut history = String::from(
"System: You are a helpful, concise voice assistant.\n\n"
);
println!("\n--- Ready. Speak into your microphone. Ctrl+C to exit. ---\n");
loop {
// Capture speech
let mut audio_stream = audio_processor.start_capture().await?;
let mut speech_buffer: Vec<f32> = Vec::new();
let mut heard_speech = false;
while let Some(chunk) = audio_stream.next().await {
let processed = audio_processor.process_chunk(&chunk).await?;
if processed.voice_detected && processed.signal_level > 0.1 {
if !heard_speech {
print!("[listening] ");
std::io::stdout().flush().unwrap();
heard_speech = true;
}
speech_buffer.extend_from_slice(&processed.audio_data);
} else if heard_speech {
println!("done");
break;
}
}
if speech_buffer.is_empty() { continue; }
// Transcribe
let audio_input = AudioInput::from_samples(&speech_buffer, 16000, 1)?;
let result = multimodal.process_audio(&audio_input).await?;
let transcript = match result.transcript {
Some(text) if !text.trim().is_empty() => text,
_ => { println!("(no speech recognized)"); continue; }
};
println!("You: {}", transcript);
// Generate response
let prompt = format!("{}User: {}\nAssistant:", history, transcript);
let config = StreamConfig::default()
.max_tokens(150)
.temperature(0.7);
let mut stream = TokenStream::new(model.clone(), &prompt, config).await?;
let mut response = String::new();
print!("Assistant: ");
while let Some(token) = stream.next().await {
let token = token?;
if response.contains("\nUser:") { break; }
print!("{}", token.text);
std::io::stdout().flush().unwrap();
response.push_str(&token.text);
if token.is_final { break; }
}
println!("\n");
// Update history
history.push_str(&format!(
"User: {}\nAssistant: {}\n", transcript, response.trim()
));
if history.len() > 3000 {
let trim = history.len() - 2000;
if let Some(p) = history[trim..].find('\n') {
history = history[trim + p..].to_string();
}
}
}
}
#[cfg(not(all(
feature = "streaming-audio", feature = "multimodal",
feature = "async", feature = "streaming"
)))]
{
println!("This example requires features: streaming-audio, multimodal, async, streaming");
println!("Run with:");
println!(" cargo run --example voice_assistant --features \"streaming-audio,multimodal,async,streaming\"");
}
Ok(())
}
Extension: Text-to-Speech Output¶
Add TTS output so the assistant speaks its responses aloud. This requires an external TTS engine since Mullama focuses on inference:
// Example using system TTS (platform-specific)
fn speak_response(text: &str) {
#[cfg(target_os = "macos")]
{
std::process::Command::new("say")
.arg(text)
.spawn()
.expect("Failed to invoke macOS TTS");
}
#[cfg(target_os = "linux")]
{
std::process::Command::new("espeak-ng")
.arg(text)
.spawn()
.expect("Failed to invoke espeak-ng. Install with: sudo apt install espeak-ng");
}
}
What's Next¶
- Multimodal Processing -- Understand the multimodal pipeline in depth
- Streaming Generation -- Deep dive into token streaming patterns
- API Server -- Expose voice capabilities over HTTP/WebSocket
- Advanced: Streaming Audio -- Low-level audio architecture