API Surface¶

Public types in Llmdot.Core, grouped by namespace.

`Llmdot.Inference`¶

`LoadedModel`¶

A GGUF model loaded into memory, ready for inference.

public sealed class LoadedModel : IDisposable
{
    public TransformerConfig Config { get; }
    public BpeTokenizer Tokenizer { get; }
    public ChatTemplate? ChatTemplate { get; }
    public ModelCapabilities Capabilities { get; }

    public static LoadedModel Load(Stream stream);
    public ModelValidationResult Validate();
    public void Dispose();
}

The stream is owned by the returned instance and disposed with it.

`InferenceEngine`¶

Runs autoregressive inference on a LoadedModel using an IComputeBackend for tensor operations.

public sealed class InferenceEngine
{
    public InferenceEngine(LoadedModel model);                          // CPU backend
    public InferenceEngine(LoadedModel model, IComputeBackend backend); // custom backend

    public IAsyncEnumerable<int> Generate(
        int[] promptTokens,
        GenerationOptions options,
        CancellationToken cancellationToken = default);
}

Generate yields token IDs. Decode with model.Tokenizer.Decode(...).

`ChatSession`¶

Multi-turn chat over a LoadedModel. Handles prompt formatting via the model's ChatTemplate (or a <role>content fallback), encoding, decoding, and stop-sequence detection.

public sealed class ChatSession
{
    public ChatSession(LoadedModel model);

    public IAsyncEnumerable<string> GenerateAsync(
        string prompt,
        GenerationOptions? options = null,
        CancellationToken cancellationToken = default);

    public void Reset();
}

`GenerationOptions`¶

public sealed class GenerationOptions
{
    public int MaxTokens { get; init; } = 256;
    public SamplingOptions Sampling { get; init; } = new();
    public string? StopSequence { get; init; }
    public string[] StopSequences { get; init; } = [];
}

A GenerationOptionsBuilder is also provided for fluent construction.

`IComputeBackend`¶

Tensor compute abstraction so CPU and GPU backends are interchangeable. All operations work on contiguous Span<float> (or quantized ReadOnlySpan<byte> for weights).

public interface IComputeBackend : IDisposable
{
    void MatMul(ReadOnlySpan<float> a, ReadOnlySpan<byte> b, Span<float> result,
                GgmlType bType, int aCols, int bCols, string? weightKey = null);
    void MatMulF32(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result,
                   int aCols, int bCols, string? weightKey = null);
    void RmsNorm(ReadOnlySpan<float> input, ReadOnlySpan<float> weights, Span<float> output, float epsilon);
    void LayerNorm(ReadOnlySpan<float> input, ReadOnlySpan<float> weights, ReadOnlySpan<float> bias,
                   Span<float> output, float epsilon);
    void ApplyRoPE(Span<float> query, Span<float> key, int headDim, int position, float freqBase, int rotaryDim);
    void ApplyRoPE(Span<float> query, Span<float> key, int headDim, int position, ReadOnlySpan<float> freqTable);
    void Softmax(Span<float> input, float? softcap = null);
    void Silu(ReadOnlySpan<float> input, Span<float> result);
    void SiluInPlace(Span<float> input);
    void Gelu(ReadOnlySpan<float> input, Span<float> result);
    float GeluScalar(float x);
    void Add(Span<float> a, ReadOnlySpan<float> b);
    void Add(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result);
    void Scale(ReadOnlySpan<float> input, float scale, Span<float> result);
    void Scale(Span<float> input, float scale);
    void Mul(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result);
    void Softcap(Span<float> input, float cap);
    void Conv1D(ReadOnlySpan<float> input, ReadOnlySpan<float> weights, Span<float> output,
                int kernelSize, int inputDim);
    void DequantizeToFloat(ReadOnlySpan<byte> src, Span<float> dst, GgmlType type, int numRows, int rowElements);
    int ArgMax(ReadOnlySpan<float> input);
}

CpuBackend is the default managed implementation. MetalBackend and VulkanBackend (in their own assemblies) implement the same interface.

`ModelCapabilities`, `ModelInfo`¶

Derived descriptors of a loaded model (architecture name, execution template, etc.). See ModelCapabilities.FromConfig(config).

`Llmdot.Sampling`¶

`SamplingOptions`¶

public sealed class SamplingOptions
{
    public int TopK { get; init; } = 40;
    public float TopP { get; init; } = 0.95f;
    public float Temperature { get; init; } = 0.8f;
    public float RepeatPenalty { get; init; } = 1.1f;
    public int RepeatPenaltyWindowSize { get; init; } = 64;
    public int Seed { get; init; } = -1;       // -1 = random, >=0 = deterministic
}

Sampler consumes these options and runs over logits using the chosen IComputeBackend.

`Llmdot.Loading`¶

GGUF parsing primitives: GgufReader, GgufModel, GgufMetadata, GgufTensorInfo, GgufConstants, GgufValueType, GgmlType, ArchitectureResolver, TensorNameResolver, ModelValidator, ModelValidationResult.

ArchitectureResolver.Resolve(ggufModel) is what produces the TransformerConfig that every execution template reads from.

`Llmdot.Tokenization`¶

BpeTokenizer with Encode(string) -> int[] and Decode(IEnumerable<int>) -> string. Constructed via BpeTokenizer.FromGguf(metadata).

ChatTemplate parses the GGUF chat template (Jinja-style) via ChatTemplate.FromGguf(metadata, tokenizer) and renders a list of ChatMessageEntry(role, content) to a prompt string.

`Llmdot.Tensors`¶

Tensor, TensorOps, and TensorSize describe and manipulate tensor data. Dequantize and Numeric subnamespaces hold the dequantization and numeric helpers.

API Surface¶

Llmdot.Inference¶

LoadedModel¶

InferenceEngine¶

ChatSession¶

GenerationOptions¶

IComputeBackend¶

ModelCapabilities, ModelInfo¶

Llmdot.Sampling¶

SamplingOptions¶

Llmdot.Loading¶

Llmdot.Tokenization¶

Llmdot.Tensors¶