Basic Usage¶
This guide covers the fundamental concepts and patterns for using Llamafu.
Model Lifecycle¶
Initialization¶
Always initialize the model before use:
final llamafu = await Llamafu.init(
modelPath: 'path/to/model.gguf',
contextSize: 2048, // Context window size
threads: 4, // CPU threads (0 = auto)
gpuLayers: 0, // Layers to offload to GPU
useMmap: true, // Memory-map model file
useMlock: false, // Lock model in RAM
);
Disposal¶
Always dispose when done to free resources:
Memory Leaks
Failing to call dispose() will leak native memory. Use try/finally or Flutter's dispose pattern.
Model Information¶
Query model properties after initialization:
print('Model: ${llamafu.modelName}');
print('Vocab size: ${llamafu.vocabSize}');
print('Context size: ${llamafu.contextSize}');
print('Embedding size: ${llamafu.embeddingSize}');
print('Is multimodal: ${llamafu.isMultimodal}');
Text Completion¶
Basic Completion¶
With Parameters¶
final response = await llamafu.complete(
'Write a poem about nature:',
maxTokens: 200,
temperature: 0.8, // Creativity (0.0 - 2.0)
topK: 40, // Top-K sampling
topP: 0.9, // Nucleus sampling
repeatPenalty: 1.1, // Repetition penalty
seed: 42, // Reproducible output
);
Streaming¶
For real-time output:
final stream = llamafu.completeStream(
'Once upon a time',
maxTokens: 100,
);
await for (final token in stream) {
print(token); // Each token as generated
}
Tokenization¶
Encode Text to Tokens¶
final tokens = llamafu.tokenize('Hello, world!');
print('Token count: ${tokens.length}');
print('Tokens: $tokens');
Decode Tokens to Text¶
Token Information¶
// Get text representation of a single token
final piece = llamafu.tokenToPiece(1234);
// Special tokens
final bosToken = llamafu.bosToken; // Beginning of sequence
final eosToken = llamafu.eosToken; // End of sequence
Memory Management¶
Check Memory Usage¶
final memoryInfo = llamafu.getMemoryUsage();
print('Model size: ${memoryInfo.modelSize} bytes');
print('Context size: ${memoryInfo.contextSize} bytes');
print('Total: ${memoryInfo.totalSize} bytes');
KV Cache Management¶
// Clear the key-value cache
llamafu.clearKvCache();
// Defragment cache for better performance
llamafu.defragmentKvCache();
Error Handling¶
Llamafu throws typed exceptions for different error conditions:
try {
final llamafu = await Llamafu.init(modelPath: 'model.gguf');
final response = await llamafu.complete('Hello');
} on LlamafuModelLoadError catch (e) {
print('Failed to load model: $e');
} on LlamafuInferenceError catch (e) {
print('Inference failed: $e');
} on LlamafuError catch (e) {
print('General error: $e');
}
Thread Safety¶
Single-Threaded Design
Llamafu instances are not thread-safe. Use a single instance per isolate, or use Dart isolates for parallel inference.
Using Isolates¶
// In a separate isolate
void inferenceIsolate(SendPort sendPort) async {
final llamafu = await Llamafu.init(modelPath: 'model.gguf');
final receivePort = ReceivePort();
sendPort.send(receivePort.sendPort);
await for (final prompt in receivePort) {
final response = await llamafu.complete(prompt);
sendPort.send(response);
}
}
Best Practices¶
1. Reuse Instances¶
// Good: Reuse the same instance
class ModelService {
late final Llamafu _llamafu;
Future<void> init() async {
_llamafu = await Llamafu.init(modelPath: 'model.gguf');
}
Future<String> generate(String prompt) {
return _llamafu.complete(prompt);
}
}
2. Handle Cancellation¶
// Set up abort callback
llamafu.setAbortCallback(() {
return _shouldCancel; // Return true to abort
});
// Trigger cancellation
_shouldCancel = true;
3. Warm Up the Model¶
4. Use Appropriate Context Size¶
// Smaller context = less memory, faster inference
final llamafu = await Llamafu.init(
modelPath: 'model.gguf',
contextSize: 512, // Use smallest size that fits your use case
);
Next Steps¶
- Text Generation - Advanced generation options
- Chat Sessions - Conversational interfaces
- Performance Tuning - Optimization strategies