Basic Usage¶

This guide covers the fundamental concepts and patterns for using Llamafu.

Model Lifecycle¶

Initialization¶

Always initialize the model before use:

final llamafu = await Llamafu.init(
  modelPath: 'path/to/model.gguf',
  contextSize: 2048,        // Context window size
  threads: 4,               // CPU threads (0 = auto)
  gpuLayers: 0,             // Layers to offload to GPU
  useMmap: true,            // Memory-map model file
  useMlock: false,          // Lock model in RAM
);

Disposal¶

Always dispose when done to free resources:

llamafu.dispose();

Memory Leaks

Failing to call dispose() will leak native memory. Use try/finally or Flutter's dispose pattern.

Model Information¶

Query model properties after initialization:

print('Model: ${llamafu.modelName}');
print('Vocab size: ${llamafu.vocabSize}');
print('Context size: ${llamafu.contextSize}');
print('Embedding size: ${llamafu.embeddingSize}');
print('Is multimodal: ${llamafu.isMultimodal}');

Text Completion¶

Basic Completion¶

final response = await llamafu.complete(
  'The quick brown fox',
  maxTokens: 50,
);

With Parameters¶

final response = await llamafu.complete(
  'Write a poem about nature:',
  maxTokens: 200,
  temperature: 0.8,      // Creativity (0.0 - 2.0)
  topK: 40,              // Top-K sampling
  topP: 0.9,             // Nucleus sampling
  repeatPenalty: 1.1,    // Repetition penalty
  seed: 42,              // Reproducible output
);

Streaming¶

For real-time output:

final stream = llamafu.completeStream(
  'Once upon a time',
  maxTokens: 100,
);

await for (final token in stream) {
  print(token); // Each token as generated
}

Tokenization¶

Encode Text to Tokens¶

final tokens = llamafu.tokenize('Hello, world!');
print('Token count: ${tokens.length}');
print('Tokens: $tokens');

Decode Tokens to Text¶

final text = llamafu.detokenize([1, 2, 3, 4]);
print('Text: $text');

Token Information¶

// Get text representation of a single token
final piece = llamafu.tokenToPiece(1234);

// Special tokens
final bosToken = llamafu.bosToken;  // Beginning of sequence
final eosToken = llamafu.eosToken;  // End of sequence

Memory Management¶

Check Memory Usage¶

final memoryInfo = llamafu.getMemoryUsage();
print('Model size: ${memoryInfo.modelSize} bytes');
print('Context size: ${memoryInfo.contextSize} bytes');
print('Total: ${memoryInfo.totalSize} bytes');

KV Cache Management¶

// Clear the key-value cache
llamafu.clearKvCache();

// Defragment cache for better performance
llamafu.defragmentKvCache();

Error Handling¶

Llamafu throws typed exceptions for different error conditions:

try {
  final llamafu = await Llamafu.init(modelPath: 'model.gguf');
  final response = await llamafu.complete('Hello');
} on LlamafuModelLoadError catch (e) {
  print('Failed to load model: $e');
} on LlamafuInferenceError catch (e) {
  print('Inference failed: $e');
} on LlamafuError catch (e) {
  print('General error: $e');
}

Thread Safety¶

Single-Threaded Design

Llamafu instances are not thread-safe. Use a single instance per isolate, or use Dart isolates for parallel inference.

Using Isolates¶

// In a separate isolate
void inferenceIsolate(SendPort sendPort) async {
  final llamafu = await Llamafu.init(modelPath: 'model.gguf');

  final receivePort = ReceivePort();
  sendPort.send(receivePort.sendPort);

  await for (final prompt in receivePort) {
    final response = await llamafu.complete(prompt);
    sendPort.send(response);
  }
}

Best Practices¶

1. Reuse Instances¶

// Good: Reuse the same instance
class ModelService {
  late final Llamafu _llamafu;

  Future<void> init() async {
    _llamafu = await Llamafu.init(modelPath: 'model.gguf');
  }

  Future<String> generate(String prompt) {
    return _llamafu.complete(prompt);
  }
}

2. Handle Cancellation¶

// Set up abort callback
llamafu.setAbortCallback(() {
  return _shouldCancel; // Return true to abort
});

// Trigger cancellation
_shouldCancel = true;

3. Warm Up the Model¶

// Run a small completion to warm up caches
await llamafu.warmup();

4. Use Appropriate Context Size¶

// Smaller context = less memory, faster inference
final llamafu = await Llamafu.init(
  modelPath: 'model.gguf',
  contextSize: 512,  // Use smallest size that fits your use case
);

Next Steps¶

Text Generation - Advanced generation options
Chat Sessions - Conversational interfaces
Performance Tuning - Optimization strategies