Roadmap & Status¶

Pre-alpha. The specification, architecture, and execution template design are stable. Implementation is in active development. Do not use in production yet.

Initial release targets¶

[x] Architecture and execution template design
[ ] GGUF loader (header, metadata, tensors, tokenizer assets)
[ ] TransformerConfig resolver across all four templates
[ ] CPU reference backend with quantized kernels
[ ] LLaMA-like template end-to-end
[ ] Token streaming via IAsyncEnumerable<T>
[ ] IChatClient integration
[ ] Remaining three execution templates
[ ] Optional GPU backends

Roadmap principles¶

The roadmap favors a usable vertical slice over broad early ambition. llmdot becomes credible when it can load a real GGUF model, run a complete generation loop, and integrate cleanly into ordinary .NET applications.

Phase 0 — Foundation (complete)¶

Repository structure and package boundaries
Target frameworks: net8.0, net9.0, net10.0 (net8.0 LTS is the compatibility floor)
Architecture and template design frozen (four execution templates covering all 1–8B model families)
Multimodal scope: small multimodal models via pluggable modality encoders
Quantization scope: block quants (Q4_0 / Q4_1 / Q5_0 / Q5_1 / Q8_0), K-quants (Q2_K–Q6_K), F16 / F32 / BF16
Acceleration strategy: GPU (Vulkan, Metal) is in scope; NPU is explicitly out of scope (graph compilers, shared RAM bandwidth, ONNX conversion overhead)
Coding conventions: C# 13, nullable enabled, warnings as errors, .editorconfig enforced

Phase 1 — Managed core runtime¶

GGUF parsing for all supported architectures
TransformerConfig resolution from GGUF metadata
TensorNameResolver for architecture-tolerant tensor access
Managed tensor and buffer abstractions
Four execution templates (LLaMA-like, GPT-NeoX-like, Gemma-like, LFM2-like)
Token generation on a CPU-only backend
Basic sampling and streaming APIs
BPE tokenizer decoding from GGUF metadata
1D causal convolution primitive for LFM2 conv blocks

Exit criteria: a supported GGUF model from each template can be loaded directly, the runtime can generate text end-to-end on CPU for at least one model per template, TransformerConfig is correctly resolved from real GGUF files, token streaming works through an idiomatic public API.

Beyond Phase 1¶

Subsequent phases cover broader template coverage, the Microsoft.Extensions.AI integration surface hardening, optional GPU backends (Metal, Vulkan), and the multimodal connectors. See the repository's doc/roadmap.md for the authoritative long-form plan.

Contributing¶

Design feedback is welcome. Please read doc/vision.md and doc/architecture.md before opening an issue — most "why not X?" questions have explicit answers there (especially around ONNX, NPU, and native wrapping).

Areas most valuable right now:

GGUF quantization format coverage
Managed kernel optimization (intrinsics, vectorization)
Tokenizer correctness across BPE variants
Test fixtures for additional model families