Supported Models¶
All supported architectures collapse into four execution templates. Within each template, all variation is expressed through TransformerConfig — no per-model code paths.
Architecture map¶
general.architecture |
Model families | Template |
|---|---|---|
llama |
LLaMA-1/2/3, Mistral-7B, TinyLlama-1.1B, Mixtral | LLaMA-like |
phi3 |
Phi-3 Mini / Medium | LLaMA-like |
qwen2 |
Qwen-2 1.5B / 7B | LLaMA-like |
stablelm |
StableLM-2 1.6B | LLaMA-like (with flags) |
phi2 |
Phi-2 2.7B | GPT-NeoX-like |
gptneox |
Pythia 1.4B / 6.9B | GPT-NeoX-like |
gemma |
Gemma 2B | Gemma-like |
gemma2 |
Gemma-2 2B / 9B | Gemma-like |
lfm2 |
LFM2 350M / 700M / 1.2B / 2.6B, LFM2-VL | LFM2-like |
lfm2_moe |
LFM2-8B-A1B, LFM2-24B-A2B | LFM2-like (MoE) |
Execution templates¶
LLaMA-like (sequential pre-norm)¶
- Pre-norm placement (norm before attention and FFN)
- Separate Q/K/V tensors (or fused, handled by
TensorNameResolver) - SwiGLU or GeGLU FFN (detected by presence of gate tensor)
- No post-norm by default
GPT-NeoX-like (parallel residual)¶
Parallel attention and FFN, fused QKV, partial rotary.
Gemma-like¶
Embedding scaling, post-norm, softcapping.
LFM2-like (hybrid convolution-attention)¶
LFM2 (Liquid Foundation Model 2) replaces ~62% of attention layers with double-gated LIV (Liquid Input-Varying) causal convolutions — significantly faster on-device than a pure transformer of comparable size.
Key properties:
- Hybrid layer layout: most layers are gated convolutions (kernel=3); only ~38% are GQA attention layers
- GQA: 16/32 Q heads, 8 KV heads (2:1 ratio), RoPE with θ = 1,000,000
- SwiGLU FFN with auto-adjusted intermediate dimension
- Vocab: 65,536 (custom BPE with ChatML-style special tokens)
- MoE variants: 32 experts, top-4 routing (8B-A1B, 24B-A2B)
- Context: 32K
Multimodal LFM2 variants:
- LFM2-VL — vision-language (SigLIP2 vision encoder + LFM2 backbone + 2-layer MLP connector)
- LFM2-Audio — speech-to-speech (FastConformer encoder + Mimi codec decoder)
The LFM2 template introduces a new structural primitive — the gated convolution block — that does not exist in the other three templates, which is why it gets its own execution template rather than being expressed through the LLaMA-like path.
Multimodal¶
Multimodal variants (vision-language via SigLIP2, speech via FastConformer / Mimi) plug in as modality encoders on top of the base LLM backbone. The core runtime is unchanged. Sentinel tokens in the tokenizer trigger modality-specific processing.
Quantization formats¶
The core covers all common GGUF types:
- Block quantization:
Q4_0,Q4_1,Q5_0,Q5_1,Q8_0 - K-quants:
Q2_KthroughQ6_K - Float:
F16,F32,BF16
Dequantization is dispatched on Loading.GgmlType through IComputeBackend.DequantizeToFloat (or TensorOps.DequantizeToFloat for the loader-side caching path in LoadedModel).
Reference¶
The repository's doc/model-architectures.md is the long-form architecture reference and supersedes anything that drifts between docs.