Model Loading¶
All loading goes through LoadedModel.Load(Stream) in Llmdot.Core.
The stream is owned by the returned LoadedModel and is disposed when the model is disposed.
What loading does¶
LoadedModel.Load runs the following steps:
GgufReader.Read(stream)parses the GGUF header, metadata, and tensor descriptors.ArchitectureResolver.Resolve(ggufModel)produces aTransformerConfigfrom GGUF metadata keys. This is the unified config every execution template reads from — never raw GGUF keys.TensorNameResolverindexes the tensor descriptors so the graph can look up tensors by logical name across architectures.- Tensor data is read from the file at
TensorDataOffset + info.Offsetfor each tensor and stored as aTensor(name, dimensions, element type, raw bytes, element count). BpeTokenizer.FromGguf(metadata)builds the BPE tokenizer from vocab/merges in the GGUF metadata.ChatTemplate.FromGguf(metadata, tokenizer)parses an optional chat template if present.
Inspecting a loaded model¶
LoadedModel exposes:
Config— the resolvedTransformerConfig(hidden size, layer count, context length, vocab, heads, rope params, BOS/EOS, FFN type, embedding scale, conv params for hybrid models, etc.)Tokenizer— the BPE tokenizerChatTemplate— nullable; present only if the GGUF metadata included oneCapabilities— derived fromConfig, including the architecture name and execution templateValidate()— runsModelValidatorover the loaded GGUF and config
The CLI's info command and the sample app both print a summary of these properties.
Validation¶
ModelValidator.Validate runs through the GGUF model and resolved config to surface issues before inference.
Disposal¶
LoadedModel owns the underlying stream. Always dispose it:
Or with a using declaration:
Loaded layout¶
Internally, LoadedModel:
- Reads each tensor's bytes from the file into memory
- Caches per-tensor dequantized
float[]buffers on first access viaGetDequantizedWeights(name)so repeated reads of the same weight don't repeat the dequantization - Uses
TensorOps.DequantizeToFloatdriven byLoading.GgmlType(Q4_0/Q4_1/Q5_0/Q5_1/Q8_0, Q2_K–Q6_K, F16/F32/BF16)
See Architecture for the broader execution-template story this loading step feeds into.