Inference Options¶

Parameters for text generation and inference operations.

InferParams¶

Parameters for complete() and completeStream().

class InferParams {
  final String prompt;
  final int maxTokens;
  final double temperature;
  final int topK;
  final double topP;
  final double repeatPenalty;
  final int seed;
  final List<String>? stopSequences;
}

Parameter Reference¶

`prompt`¶

The input text to complete.

Type: String
Required: Yes

// Simple prompt
'What is the capital of France?'

// With context
'''Context: Paris is a city in France.
Question: What is the capital of France?
Answer:'''

// With chat format
'<|user|>Hello<|assistant|>'

`maxTokens`¶

Maximum number of tokens to generate.

Type: int
Default: 256
Range: 1 to context size

Use Case	Recommended
Short answer	50-100
Paragraph	150-300
Long response	500-1000
Maximum	Context size - prompt tokens

`temperature`¶

Controls randomness of output.

Type: double
Default: 0.7
Range: 0.0 to 2.0

// Deterministic (greedy)
temperature: 0.0

// Low creativity
temperature: 0.3

// Balanced
temperature: 0.7

// High creativity
temperature: 1.0

// Very random
temperature: 1.5

`topK`¶

Limits token selection to top K candidates.

Type: int
Default: 40
Range: 1 to vocabulary size

Value	Effect
1	Greedy (only top token)
10-20	Focused
40-50	Balanced
100+	Wide selection
0	Disabled

`topP`¶

Nucleus sampling - cumulative probability threshold.

Type: double
Default: 0.9
Range: 0.0 to 1.0

// Very focused
topP: 0.5

// Balanced
topP: 0.9

// Wide (almost all tokens)
topP: 0.99

// Disabled
topP: 1.0

`repeatPenalty`¶

Penalizes repeated tokens.

Type: double
Default: 1.1
Range: 1.0 to 2.0

Value	Effect
1.0	No penalty
1.1	Light penalty
1.2	Moderate penalty
1.5+	Strong penalty

`seed`¶

Random seed for reproducibility.

Type: int
Default: 0 (random)

// Reproducible output
final result1 = await llamafu.complete(prompt, seed: 42);
final result2 = await llamafu.complete(prompt, seed: 42);
// result1 == result2 (same seed)

`stopSequences`¶

Stop generation when encountering these strings.

Type: List<String>?
Default: null

await llamafu.complete(
  'List three colors:\n1.',
  stopSequences: ['\n4.', '\n\n'],  // Stop after 3 items
);

Preset Configurations¶

Factual Q&A¶

await llamafu.complete(
  prompt,
  temperature: 0.1,
  topK: 10,
  topP: 0.9,
  maxTokens: 100,
);

Creative Writing¶

await llamafu.complete(
  prompt,
  temperature: 0.9,
  topK: 50,
  topP: 0.95,
  repeatPenalty: 1.2,
  maxTokens: 500,
);

Code Generation¶

await llamafu.complete(
  prompt,
  temperature: 0.2,
  topK: 20,
  topP: 0.9,
  repeatPenalty: 1.0,
  maxTokens: 300,
);

Chat/Dialogue¶

await llamafu.complete(
  prompt,
  temperature: 0.7,
  topK: 40,
  topP: 0.9,
  repeatPenalty: 1.1,
  maxTokens: 200,
);

Streaming Options¶

completeStream() accepts the same parameters:

final stream = llamafu.completeStream(
  prompt,
  maxTokens: 200,
  temperature: 0.8,
);

await for (final token in stream) {
  stdout.write(token);
}

Grammar Constraints¶

For structured output:

await llamafu.completeWithGrammar(
  'Generate a number:',
  grammar: 'root ::= [0-9]+',
  maxTokens: 10,
  temperature: 0.5,
);

Performance Considerations¶

Token Generation Speed¶

Factors affecting speed: 1. Model size (smaller = faster) 2. Quantization (Q4 faster than Q8) 3. Context length (longer = slower) 4. Temperature (0.0 slightly faster)

Memory Usage¶

Each generated token adds to KV cache:

Memory per token ≈ 2 × layers × embedding_size × sizeof(float16)

Best Practices¶

Set maxTokens appropriately - don't over-allocate
Use lower temperature for factual content
Enable stopSequences when format is known
Clear KV cache between unrelated prompts