API Surface¶
Public types in Llmdot.Core, grouped by namespace.
Llmdot.Inference¶
LoadedModel¶
A GGUF model loaded into memory, ready for inference.
public sealed class LoadedModel : IDisposable
{
public TransformerConfig Config { get; }
public BpeTokenizer Tokenizer { get; }
public ChatTemplate? ChatTemplate { get; }
public ModelCapabilities Capabilities { get; }
public static LoadedModel Load(Stream stream);
public ModelValidationResult Validate();
public void Dispose();
}
The stream is owned by the returned instance and disposed with it.
InferenceEngine¶
Runs autoregressive inference on a LoadedModel using an IComputeBackend for tensor operations.
public sealed class InferenceEngine
{
public InferenceEngine(LoadedModel model); // CPU backend
public InferenceEngine(LoadedModel model, IComputeBackend backend); // custom backend
public IAsyncEnumerable<int> Generate(
int[] promptTokens,
GenerationOptions options,
CancellationToken cancellationToken = default);
}
Generate yields token IDs. Decode with model.Tokenizer.Decode(...).
ChatSession¶
Multi-turn chat over a LoadedModel. Handles prompt formatting via the model's ChatTemplate (or a <role>content fallback), encoding, decoding, and stop-sequence detection.
public sealed class ChatSession
{
public ChatSession(LoadedModel model);
public IAsyncEnumerable<string> GenerateAsync(
string prompt,
GenerationOptions? options = null,
CancellationToken cancellationToken = default);
public void Reset();
}
GenerationOptions¶
public sealed class GenerationOptions
{
public int MaxTokens { get; init; } = 256;
public SamplingOptions Sampling { get; init; } = new();
public string? StopSequence { get; init; }
public string[] StopSequences { get; init; } = [];
}
A GenerationOptionsBuilder is also provided for fluent construction.
IComputeBackend¶
Tensor compute abstraction so CPU and GPU backends are interchangeable. All operations work on contiguous Span<float> (or quantized ReadOnlySpan<byte> for weights).
public interface IComputeBackend : IDisposable
{
void MatMul(ReadOnlySpan<float> a, ReadOnlySpan<byte> b, Span<float> result,
GgmlType bType, int aCols, int bCols, string? weightKey = null);
void MatMulF32(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result,
int aCols, int bCols, string? weightKey = null);
void RmsNorm(ReadOnlySpan<float> input, ReadOnlySpan<float> weights, Span<float> output, float epsilon);
void LayerNorm(ReadOnlySpan<float> input, ReadOnlySpan<float> weights, ReadOnlySpan<float> bias,
Span<float> output, float epsilon);
void ApplyRoPE(Span<float> query, Span<float> key, int headDim, int position, float freqBase, int rotaryDim);
void ApplyRoPE(Span<float> query, Span<float> key, int headDim, int position, ReadOnlySpan<float> freqTable);
void Softmax(Span<float> input, float? softcap = null);
void Silu(ReadOnlySpan<float> input, Span<float> result);
void SiluInPlace(Span<float> input);
void Gelu(ReadOnlySpan<float> input, Span<float> result);
float GeluScalar(float x);
void Add(Span<float> a, ReadOnlySpan<float> b);
void Add(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result);
void Scale(ReadOnlySpan<float> input, float scale, Span<float> result);
void Scale(Span<float> input, float scale);
void Mul(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result);
void Softcap(Span<float> input, float cap);
void Conv1D(ReadOnlySpan<float> input, ReadOnlySpan<float> weights, Span<float> output,
int kernelSize, int inputDim);
void DequantizeToFloat(ReadOnlySpan<byte> src, Span<float> dst, GgmlType type, int numRows, int rowElements);
int ArgMax(ReadOnlySpan<float> input);
}
CpuBackend is the default managed implementation. MetalBackend and VulkanBackend (in their own assemblies) implement the same interface.
ModelCapabilities, ModelInfo¶
Derived descriptors of a loaded model (architecture name, execution template, etc.). See ModelCapabilities.FromConfig(config).
Llmdot.Sampling¶
SamplingOptions¶
public sealed class SamplingOptions
{
public int TopK { get; init; } = 40;
public float TopP { get; init; } = 0.95f;
public float Temperature { get; init; } = 0.8f;
public float RepeatPenalty { get; init; } = 1.1f;
public int RepeatPenaltyWindowSize { get; init; } = 64;
public int Seed { get; init; } = -1; // -1 = random, >=0 = deterministic
}
Sampler consumes these options and runs over logits using the chosen IComputeBackend.
Llmdot.Loading¶
GGUF parsing primitives: GgufReader, GgufModel, GgufMetadata, GgufTensorInfo, GgufConstants, GgufValueType, GgmlType, ArchitectureResolver, TensorNameResolver, ModelValidator, ModelValidationResult.
ArchitectureResolver.Resolve(ggufModel) is what produces the TransformerConfig that every execution template reads from.
Llmdot.Tokenization¶
BpeTokenizer with Encode(string) -> int[] and Decode(IEnumerable<int>) -> string. Constructed via BpeTokenizer.FromGguf(metadata).
ChatTemplate parses the GGUF chat template (Jinja-style) via ChatTemplate.FromGguf(metadata, tokenizer) and renders a list of ChatMessageEntry(role, content) to a prompt string.
Llmdot.Tensors¶
Tensor, TensorOps, and TensorSize describe and manipulate tensor data. Dequantize and Numeric subnamespaces hold the dequantization and numeric helpers.