Inference Options¶
Parameters for text generation and inference operations.
InferParams¶
Parameters for complete() and completeStream().
class InferParams {
final String prompt;
final int maxTokens;
final double temperature;
final int topK;
final double topP;
final double repeatPenalty;
final int seed;
final List<String>? stopSequences;
}
Parameter Reference¶
prompt¶
The input text to complete.
- Type:
String - Required: Yes
// Simple prompt
'What is the capital of France?'
// With context
'''Context: Paris is a city in France.
Question: What is the capital of France?
Answer:'''
// With chat format
'<|user|>Hello<|assistant|>'
maxTokens¶
Maximum number of tokens to generate.
- Type:
int - Default:
256 - Range:
1to context size
| Use Case | Recommended |
|---|---|
| Short answer | 50-100 |
| Paragraph | 150-300 |
| Long response | 500-1000 |
| Maximum | Context size - prompt tokens |
temperature¶
Controls randomness of output.
- Type:
double - Default:
0.7 - Range:
0.0to2.0
// Deterministic (greedy)
temperature: 0.0
// Low creativity
temperature: 0.3
// Balanced
temperature: 0.7
// High creativity
temperature: 1.0
// Very random
temperature: 1.5
topK¶
Limits token selection to top K candidates.
- Type:
int - Default:
40 - Range:
1to vocabulary size
| Value | Effect |
|---|---|
| 1 | Greedy (only top token) |
| 10-20 | Focused |
| 40-50 | Balanced |
| 100+ | Wide selection |
| 0 | Disabled |
topP¶
Nucleus sampling - cumulative probability threshold.
- Type:
double - Default:
0.9 - Range:
0.0to1.0
// Very focused
topP: 0.5
// Balanced
topP: 0.9
// Wide (almost all tokens)
topP: 0.99
// Disabled
topP: 1.0
repeatPenalty¶
Penalizes repeated tokens.
- Type:
double - Default:
1.1 - Range:
1.0to2.0
| Value | Effect |
|---|---|
| 1.0 | No penalty |
| 1.1 | Light penalty |
| 1.2 | Moderate penalty |
| 1.5+ | Strong penalty |
seed¶
Random seed for reproducibility.
- Type:
int - Default:
0(random)
// Reproducible output
final result1 = await llamafu.complete(prompt, seed: 42);
final result2 = await llamafu.complete(prompt, seed: 42);
// result1 == result2 (same seed)
stopSequences¶
Stop generation when encountering these strings.
- Type:
List<String>? - Default:
null
await llamafu.complete(
'List three colors:\n1.',
stopSequences: ['\n4.', '\n\n'], // Stop after 3 items
);
Preset Configurations¶
Factual Q&A¶
Creative Writing¶
await llamafu.complete(
prompt,
temperature: 0.9,
topK: 50,
topP: 0.95,
repeatPenalty: 1.2,
maxTokens: 500,
);
Code Generation¶
await llamafu.complete(
prompt,
temperature: 0.2,
topK: 20,
topP: 0.9,
repeatPenalty: 1.0,
maxTokens: 300,
);
Chat/Dialogue¶
await llamafu.complete(
prompt,
temperature: 0.7,
topK: 40,
topP: 0.9,
repeatPenalty: 1.1,
maxTokens: 200,
);
Streaming Options¶
completeStream() accepts the same parameters:
final stream = llamafu.completeStream(
prompt,
maxTokens: 200,
temperature: 0.8,
);
await for (final token in stream) {
stdout.write(token);
}
Grammar Constraints¶
For structured output:
await llamafu.completeWithGrammar(
'Generate a number:',
grammar: 'root ::= [0-9]+',
maxTokens: 10,
temperature: 0.5,
);
Performance Considerations¶
Token Generation Speed¶
Factors affecting speed: 1. Model size (smaller = faster) 2. Quantization (Q4 faster than Q8) 3. Context length (longer = slower) 4. Temperature (0.0 slightly faster)
Memory Usage¶
Each generated token adds to KV cache:
Best Practices¶
- Set
maxTokensappropriately - don't over-allocate - Use lower temperature for factual content
- Enable
stopSequenceswhen format is known - Clear KV cache between unrelated prompts