Memory-Mapped I/O (mmap)¶
Loading a 13-billion-parameter model file (roughly 7 GB in Q4_0 quantization) by reading it into a heap buffer would require 7 GB of physical memory plus the time to copy every byte through the kernel's read path. Memory-mapped I/O sidesteps both costs: the kernel maps the file's pages directly into the process's virtual address space, and physical pages are loaded on demand as the process touches them.
1. Virtual Memory Theory¶
1.1 Page Tables and Demand Paging¶
Modern operating systems present each process with a flat, contiguous virtual address space. The CPU's Memory Management Unit (MMU) translates virtual addresses to physical addresses through a hierarchy of page tables.
flowchart LR
VA["Virtual Address"] --> MMU["MMU / TLB"]
MMU -->|hit| PA["Physical Address"]
MMU -->|miss| PT["Page Table Walk"]
PT -->|present| PA
PT -->|not present| PF["Page Fault"]
PF --> DISK["Load from Disk"]
DISK --> PA When a virtual page has no physical backing (the "not present" bit is set), the CPU raises a page fault. The kernel's page fault handler then:
- Identifies the file offset for a memory-mapped page.
- Allocates a physical frame.
- Reads the file data into that frame.
- Updates the page table entry.
- Resumes execution -- the instruction retries and succeeds transparently.
1.2 Translation Lookaside Buffer (TLB)¶
The TLB is a small, fully-associative cache inside the CPU that stores recent virtual-to-physical translations. A TLB miss triggers an expensive page-table walk. For large models, the number of pages can exceed TLB capacity, making huge pages (Section 5) important.
Typical TLB Sizes
| Level | Entries | Page Size |
|---|---|---|
| L1 dTLB | 64 | 4 KB |
| L2 sTLB | 1536 | 4 KB |
| L1 dTLB (huge) | 32 | 2 MB |
A 7 GB model mapped with 4 KB pages requires 1,835,008 TLB entries -- far more than available. With 2 MB huge pages, only 3,584 entries are needed.
2. Why mmap for Model Loading¶
Zero-Copy Guarantee
When a file is memory-mapped with PROT_READ | MAP_PRIVATE, the kernel serves pages directly from the page cache. No read() system call copies data from kernel space to user space; the process accesses the same physical frames that the page cache holds.
Benefits for LLM inference:
| Benefit | Explanation |
|---|---|
| Zero copy | No user-space buffer needed; data stays in page cache |
| Lazy loading | Only pages actually accessed incur I/O |
| Shared across processes | Multiple inference servers share physical pages |
| OS-managed eviction | Kernel evicts cold pages under memory pressure |
| Instant "load" | mmap() returns in microseconds; I/O is deferred |
3. MemoryMap Struct¶
ZigLlama wraps the platform mmap system call in src/foundation/memory_mapping.zig:
3.1 Core Fields¶
pub const MemoryMap = struct {
ptr: [*]align(std.mem.page_size) u8,
len: usize,
fd: os.fd_t,
locked: bool,
// ...
};
| Field | Type | Purpose |
|---|---|---|
ptr | [*]align(page_size) u8 | Page-aligned pointer to mapped region |
len | usize | Total mapped size in bytes |
fd | os.fd_t | File descriptor (or -1 for anonymous mappings) |
locked | bool | Whether mlock has been called |
3.2 fromFile() -- Creating a File-Backed Mapping¶
pub fn fromFile(path: []const u8, protection: Protection, flags: Flags) !Self {
const file = std.fs.cwd().openFile(path, .{}) catch |err| { ... };
defer file.close();
const file_size = try file.getEndPos();
if (file_size == 0) return error.EmptyFile;
const prot = createProtectionFlags(protection);
const map_flags = createMappingFlags(flags);
const ptr = os.mmap(null, file_size, prot, map_flags, file.handle, 0)
catch |err| { ... };
return Self{ .ptr = ptr, .len = file_size, .fd = file.handle, .locked = false };
}
3.3 Protection Flags¶
For model weights, only read = true is needed. Setting write = false allows the kernel to share physical pages across processes (MAP_PRIVATE + read-only = effectively MAP_SHARED at the page level).
3.4 Mapping Flags¶
pub const Flags = struct {
shared: bool = false,
private: bool = true,
anonymous: bool = false,
populate: bool = false,
huge_pages: bool = false,
};
| Flag | Linux Equivalent | Effect |
|---|---|---|
private | MAP_PRIVATE | Copy-on-write; writes are process-local |
shared | MAP_SHARED | Writes visible to other processes and flushed to file |
populate | MAP_POPULATE | Pre-fault all pages at mmap time |
huge_pages | MAP_HUGETLB | Request 2 MB pages from the huge-page pool |
4. mlock for Inference¶
During autoregressive generation, each token requires a full forward pass through the model. A page fault in the middle of a matrix multiply would cause a latency spike of several milliseconds -- unacceptable for interactive applications.
4.1 Locking Pages¶
pub fn lock(self: *Self) !void {
if (self.locked) return;
os.mlock(self.ptr[0..self.len]) catch |err| switch (err) {
error.MemoryLockingNotSupported => { ... },
error.OutOfMemory => { ... },
error.PermissionDenied => { ... },
else => return err,
};
self.locked = true;
}
mlock() instructs the kernel to:
- Fault in every page of the range.
- Pin those physical frames so they cannot be swapped out.
Resource Limits
Linux enforces a per-process locked-memory limit (ulimit -l). The default is typically 64 KB -- far too small for a model. Increase it via:
Or use CAP_IPC_LOCK for the inference process.
4.2 Prefaulting¶
If MAP_POPULATE is not available (or not desired at mmap time), ZigLlama offers an explicit prefault loop:
pub fn prefault(self: *Self) !void {
const page_size = std.mem.page_size;
var offset: usize = 0;
while (offset < self.len) {
_ = self.ptr[offset]; // touch first byte of each page
offset += page_size;
}
}
This trades startup latency for predictable per-token latency.
5. Huge Pages¶
5.1 Why Huge Pages Matter¶
TLB Pressure Analysis
For a 7 GB model file:
| Page Size | Number of Pages | TLB Coverage (64 entries) |
|---|---|---|
| 4 KB | 1,835,008 | 0.0035 % |
| 2 MB | 3,584 | 1.79 % |
| 1 GB | 7 | 100 % |
With 4 KB pages, nearly every weight access misses the TLB and triggers a page-table walk (4-5 memory accesses on x86-64). Huge pages reduce this overhead by 512x (2 MB) or 262,144x (1 GB).
5.2 Configuration¶
On Linux, the MAP_HUGETLB flag requests allocation from the kernel's huge-page pool. The pool must be pre-configured:
# Reserve 4096 huge pages of 2 MB each (8 GB total)
echo 4096 > /proc/sys/vm/nr_hugepages
# Or at boot via kernel parameter:
hugepages=4096
ZigLlama sets the flag via the Flags.huge_pages field:
const flags = MemoryMap.Flags{
.private = true,
.populate = true,
.huge_pages = true,
};
var mapping = try MemoryMap.fromFile("model.gguf", .{ .read = true }, flags);
5.3 Transparent Huge Pages (THP)¶
Linux can also promote 4 KB pages to 2 MB pages automatically. However, THP has unpredictable latency due to background compaction and is generally not recommended for latency-sensitive inference. Prefer explicit huge pages.
6. Performance Impact¶
Load Time Comparison
Measured on a 7 B parameter model (3.8 GB, Q4_0) with an NVMe SSD (sequential read 3.5 GB/s):
| Method | Wall Time | Peak RSS |
|---|---|---|
read() into malloc buffer | 1.8 s | 7.6 GB |
mmap + MAP_POPULATE | 1.1 s | 3.8 GB |
mmap + lazy (no populate) | 0.002 s | ~0 MB (initial) |
mmap + mlock + huge pages | 1.2 s | 3.8 GB |
The lazy mmap path returns almost instantly -- the cost is amortized across the first forward pass as pages are faulted in. With mlock, all pages are resident before the first token is generated, giving the most predictable latency profile.
7. Platform Considerations¶
7.1 Linux¶
- Full
mmap,mlock,madvise,MAP_POPULATE,MAP_HUGETLBsupport. /proc/self/smapscan be read to inspect per-mapping RSS.MADV_SEQUENTIALandMADV_WILLNEEDhints available viaMemoryMap.advise().
7.2 macOS¶
mmapandmlockavailable.- No
MAP_POPULATEequivalent; use theprefault()method instead. - No
MAP_HUGETLB; macOS uses "superpages" which are allocated automatically when the VM system detects large contiguous accesses. madvise(MADV_WILLNEED)is supported and triggers readahead.
7.3 Windows¶
- Windows uses
CreateFileMapping+MapViewOfFileinstead of POSIXmmap. VirtualLockreplacesmlock.- Large pages (2 MB) require the
SeLockMemoryPrivilegeprivilege and must be requested viaMEM_LARGE_PAGESinVirtualAlloc. - ZigLlama's
MemoryMapcurrently targets POSIX; Windows support would require a platform abstraction layer or Zig'sstd.os.windowsAPIs.
7.4 Summary Table¶
| Feature | Linux | macOS | Windows |
|---|---|---|---|
mmap | Yes | Yes | MapViewOfFile |
mlock | Yes | Yes | VirtualLock |
| Pre-populate | MAP_POPULATE | Manual | Manual |
| Huge pages | MAP_HUGETLB | Auto superpages | MEM_LARGE_PAGES |
madvise | Full | Partial | N/A |
8. High-Level API: ModelFileMapper¶
For convenience, ZigLlama provides ModelFileMapper which manages multiple memory mappings and applies optimal strategies:
pub const ModelFileMapper = struct {
mappings: std.ArrayList(MemoryMap),
allocator: Allocator,
pub fn loadModelFile(self: *Self, path: []const u8,
lock_memory: bool, prefault: bool) !*MemoryMap {
var mapping = try MemoryMap.fromFile(path,
.{ .read = true },
.{ .private = true, .populate = prefault });
if (lock_memory) mapping.lock() catch |err| { ... };
if (prefault and !flags.populate) try mapping.prefault();
mapping.advise(.Sequential) catch {};
try self.mappings.append(mapping);
return &self.mappings.items[self.mappings.items.len - 1];
}
};
The MappingOptimizer.detectOptimalStrategy() function selects between lazy mapping, prefaulting, and memory locking based on available system RAM relative to the model file size.