Multimodal (Vision & Audio)¶
Llamafu supports vision-language models (VLMs) and audio models through llama.cpp's multimodal capabilities.
Supported Models¶
Vision Models¶
| Model | Size | Description |
|---|---|---|
| nanoLLaVA | ~2GB | Smallest VLM, runs on mobile |
| LLaVA 1.5/1.6 | 4-14GB | Popular vision model family |
| Qwen2-VL | 2-72GB | State-of-the-art VLM |
| InternVL | 1-14GB | Strong multilingual VLM |
| Moondream | ~2GB | Compact vision model |
Audio Models¶
| Model | Size | Description |
|---|---|---|
| Ultravox | ~2GB | Audio-to-text understanding |
| Qwen2-Audio | ~8GB | Audio comprehension |
Vision Setup¶
Loading a Vision Model¶
Vision models require two files:
- Text model - Main language model (
.gguf) - Vision projector - Image encoder (
mmproj-*.gguf)
final llamafu = await Llamafu.init(
modelPath: 'models/nanollava-text.gguf',
mmprojPath: 'models/nanollava-mmproj.gguf', // Vision projector
contextSize: 2048,
);
print('Is multimodal: ${llamafu.isMultimodal}'); // true
Image Completion¶
import 'dart:io';
import 'dart:convert';
// Load image as base64
final imageBytes = await File('photo.jpg').readAsBytes();
final imageBase64 = base64Encode(imageBytes);
// Create media input
final imageInput = MediaInput(
type: MediaType.image,
data: imageBase64,
sourceType: DataSource.base64,
);
// Generate description
final response = await llamafu.multimodalComplete(
prompt: 'Describe this image in detail:',
mediaInputs: [imageInput],
maxTokens: 256,
);
print(response);
From File Path¶
final imageInput = MediaInput(
type: MediaType.image,
data: '/path/to/image.jpg',
sourceType: DataSource.filePath,
);
final response = await llamafu.multimodalComplete(
prompt: 'What objects are in this image?',
mediaInputs: [imageInput],
);
From URL¶
Streaming with Images¶
await for (final token in llamafu.multimodalCompleteStream(
prompt: 'Describe this image:',
mediaInputs: [imageInput],
maxTokens: 200,
)) {
stdout.write(token);
}
Audio Setup¶
Loading an Audio Model¶
final llamafu = await Llamafu.init(
modelPath: 'models/ultravox-1b.gguf',
mmprojPath: 'models/ultravox-mmproj.gguf',
contextSize: 4096,
);
Audio Completion¶
final audioBytes = await File('recording.wav').readAsBytes();
final audioBase64 = base64Encode(audioBytes);
final audioInput = MediaInput(
type: MediaType.audio,
data: audioBase64,
sourceType: DataSource.base64,
audioFormat: AudioFormat.wav,
sampleRate: 16000,
channels: 1,
);
final response = await llamafu.multimodalComplete(
prompt: 'Transcribe this audio:',
mediaInputs: [audioInput],
maxTokens: 500,
);
From Raw Samples¶
// Float32 audio samples at 16kHz
final Float32List samples = getAudioSamples();
final audioInput = MediaInput.fromAudioSamples(
samples,
sampleRate: 16000,
channels: 1,
);
Multiple Inputs¶
Process multiple images or mixed media:
final inputs = [
MediaInput(type: MediaType.image, data: image1Base64),
MediaInput(type: MediaType.image, data: image2Base64),
];
final response = await llamafu.multimodalComplete(
prompt: 'Compare these two images:',
mediaInputs: inputs,
maxTokens: 300,
);
Image Processing Options¶
Custom Processing¶
final result = await llamafu.processImage(imageInput);
print('Image tokens: ${result.nTokens}');
print('Processing time: ${result.processingTimeMs}ms');
Vision Parameters¶
final response = await llamafu.multimodalComplete(
prompt: 'Analyze this image:',
mediaInputs: [imageInput],
visionThreads: 4, // Threads for image processing
useVisionCache: true, // Cache processed images
includeImageTokens: true,
);
Chat with Vision¶
Use chat templates with images:
final session = llamafu.createChatSession();
// Add image context
session.addImage(imageInput);
// Chat about the image
session.addMessage('user', 'What do you see in this image?');
final response1 = await session.generate(maxTokens: 200);
session.addMessage('user', 'What colors are prominent?');
final response2 = await session.generate(maxTokens: 100);
Best Practices¶
1. Image Size¶
Resize large images before processing:
// Images are internally resized, but pre-resizing saves memory
import 'package:image/image.dart' as img;
final image = img.decodeImage(bytes);
final resized = img.copyResize(image!, width: 512);
final optimized = img.encodeJpg(resized, quality: 85);
2. Memory Management¶
3. Mobile Considerations¶
// Use smaller models on mobile
if (Platform.isAndroid || Platform.isIOS) {
modelPath = 'models/nanollava.gguf'; // ~2GB
} else {
modelPath = 'models/llava-7b.gguf'; // ~4GB
}
4. Supported Formats¶
| Format | Extension | Notes |
|---|---|---|
| JPEG | .jpg, .jpeg | Recommended |
| PNG | .png | Supports transparency |
| WebP | .webp | Good compression |
| BMP | .bmp | Uncompressed |
For audio:
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Recommended, uncompressed |
| MP3 | .mp3 | Requires decoding |
| FLAC | .flac | Lossless |
| PCM | raw | Float32, 16kHz mono preferred |
Error Handling¶
try {
final response = await llamafu.multimodalComplete(
prompt: 'Describe:',
mediaInputs: [imageInput],
);
} on LlamafuMultimodalError catch (e) {
if (e.code == ErrorCode.visionInitFailed) {
print('Vision model not loaded. Check mmproj path.');
} else if (e.code == ErrorCode.imageProcessFailed) {
print('Failed to process image: ${e.message}');
}
}
Troubleshooting¶
"Vision not initialized"¶
Ensure you provided mmprojPath during initialization:
final llamafu = await Llamafu.init(
modelPath: 'model.gguf',
mmprojPath: 'mmproj.gguf', // Required for vision
);
"Unsupported image format"¶
Convert to JPEG or PNG:
import 'package:image/image.dart' as img;
final image = img.decodeImage(bytes);
final jpeg = img.encodeJpg(image!);
Poor image understanding¶
Try: 1. Using a larger model 2. Providing more specific prompts 3. Ensuring image is clear and well-lit