Memory-Mapped I/O (mmap)¶

Loading a 13-billion-parameter model file (roughly 7 GB in Q4_0 quantization) by reading it into a heap buffer would require 7 GB of physical memory plus the time to copy every byte through the kernel's read path. Memory-mapped I/O sidesteps both costs: the kernel maps the file's pages directly into the process's virtual address space, and physical pages are loaded on demand as the process touches them.

1. Virtual Memory Theory¶

1.1 Page Tables and Demand Paging¶

Modern operating systems present each process with a flat, contiguous virtual address space. The CPU's Memory Management Unit (MMU) translates virtual addresses to physical addresses through a hierarchy of page tables.

flowchart LR
    VA["Virtual Address"] --> MMU["MMU / TLB"]
    MMU -->|hit| PA["Physical Address"]
    MMU -->|miss| PT["Page Table Walk"]
    PT -->|present| PA
    PT -->|not present| PF["Page Fault"]
    PF --> DISK["Load from Disk"]
    DISK --> PA

When a virtual page has no physical backing (the "not present" bit is set), the CPU raises a page fault. The kernel's page fault handler then:

Identifies the file offset for a memory-mapped page.
Allocates a physical frame.
Reads the file data into that frame.
Updates the page table entry.
Resumes execution -- the instruction retries and succeeds transparently.

1.2 Translation Lookaside Buffer (TLB)¶

The TLB is a small, fully-associative cache inside the CPU that stores recent virtual-to-physical translations. A TLB miss triggers an expensive page-table walk. For large models, the number of pages can exceed TLB capacity, making huge pages (Section 5) important.

Typical TLB Sizes

Level	Entries	Page Size
L1 dTLB	64	4 KB
L2 sTLB	1536	4 KB
L1 dTLB (huge)	32	2 MB

A 7 GB model mapped with 4 KB pages requires 1,835,008 TLB entries -- far more than available. With 2 MB huge pages, only 3,584 entries are needed.

2. Why mmap for Model Loading¶

Zero-Copy Guarantee

When a file is memory-mapped with PROT_READ | MAP_PRIVATE, the kernel serves pages directly from the page cache. No read() system call copies data from kernel space to user space; the process accesses the same physical frames that the page cache holds.

Benefits for LLM inference:

Benefit	Explanation
Zero copy	No user-space buffer needed; data stays in page cache
Lazy loading	Only pages actually accessed incur I/O
Shared across processes	Multiple inference servers share physical pages
OS-managed eviction	Kernel evicts cold pages under memory pressure
Instant "load"	`mmap()` returns in microseconds; I/O is deferred

3. MemoryMap Struct¶

ZigLlama wraps the platform mmap system call in src/foundation/memory_mapping.zig:

3.1 Core Fields¶

pub const MemoryMap = struct {
    ptr: [*]align(std.mem.page_size) u8,
    len: usize,
    fd: os.fd_t,
    locked: bool,
    // ...
};

Field	Type	Purpose
`ptr`	`[*]align(page_size) u8`	Page-aligned pointer to mapped region
`len`	`usize`	Total mapped size in bytes
`fd`	`os.fd_t`	File descriptor (or -1 for anonymous mappings)
`locked`	`bool`	Whether `mlock` has been called

3.2 `fromFile()` -- Creating a File-Backed Mapping¶

pub fn fromFile(path: []const u8, protection: Protection, flags: Flags) !Self {
    const file = std.fs.cwd().openFile(path, .{}) catch |err| { ... };
    defer file.close();

    const file_size = try file.getEndPos();
    if (file_size == 0) return error.EmptyFile;

    const prot = createProtectionFlags(protection);
    const map_flags = createMappingFlags(flags);

    const ptr = os.mmap(null, file_size, prot, map_flags, file.handle, 0)
        catch |err| { ... };

    return Self{ .ptr = ptr, .len = file_size, .fd = file.handle, .locked = false };
}

3.3 Protection Flags¶

pub const Protection = struct {
    read: bool = true,
    write: bool = false,
    exec: bool = false,
};

For model weights, only read = true is needed. Setting write = false allows the kernel to share physical pages across processes (MAP_PRIVATE + read-only = effectively MAP_SHARED at the page level).

3.4 Mapping Flags¶

pub const Flags = struct {
    shared: bool = false,
    private: bool = true,
    anonymous: bool = false,
    populate: bool = false,
    huge_pages: bool = false,
};

Flag	Linux Equivalent	Effect
`private`	`MAP_PRIVATE`	Copy-on-write; writes are process-local
`shared`	`MAP_SHARED`	Writes visible to other processes and flushed to file
`populate`	`MAP_POPULATE`	Pre-fault all pages at `mmap` time
`huge_pages`	`MAP_HUGETLB`	Request 2 MB pages from the huge-page pool

4. mlock for Inference¶

During autoregressive generation, each token requires a full forward pass through the model. A page fault in the middle of a matrix multiply would cause a latency spike of several milliseconds -- unacceptable for interactive applications.

4.1 Locking Pages¶

pub fn lock(self: *Self) !void {
    if (self.locked) return;

    os.mlock(self.ptr[0..self.len]) catch |err| switch (err) {
        error.MemoryLockingNotSupported => { ... },
        error.OutOfMemory => { ... },
        error.PermissionDenied => { ... },
        else => return err,
    };

    self.locked = true;
}

mlock() instructs the kernel to:

Fault in every page of the range.
Pin those physical frames so they cannot be swapped out.

Resource Limits

Linux enforces a per-process locked-memory limit (ulimit -l). The default is typically 64 KB -- far too small for a model. Increase it via:

# /etc/security/limits.conf
*  soft  memlock  unlimited
*  hard  memlock  unlimited

Or use CAP_IPC_LOCK for the inference process.

4.2 Prefaulting¶

If MAP_POPULATE is not available (or not desired at mmap time), ZigLlama offers an explicit prefault loop:

pub fn prefault(self: *Self) !void {
    const page_size = std.mem.page_size;
    var offset: usize = 0;
    while (offset < self.len) {
        _ = self.ptr[offset];   // touch first byte of each page
        offset += page_size;
    }
}

This trades startup latency for predictable per-token latency.

5. Huge Pages¶

5.1 Why Huge Pages Matter¶

TLB Pressure Analysis

For a 7 GB model file:

Page Size	Number of Pages	TLB Coverage (64 entries)
4 KB	1,835,008	0.0035 %
2 MB	3,584	1.79 %
1 GB	7	100 %

With 4 KB pages, nearly every weight access misses the TLB and triggers a page-table walk (4-5 memory accesses on x86-64). Huge pages reduce this overhead by 512x (2 MB) or 262,144x (1 GB).

5.2 Configuration¶

On Linux, the MAP_HUGETLB flag requests allocation from the kernel's huge-page pool. The pool must be pre-configured:

# Reserve 4096 huge pages of 2 MB each (8 GB total)
echo 4096 > /proc/sys/vm/nr_hugepages

# Or at boot via kernel parameter:
hugepages=4096

ZigLlama sets the flag via the Flags.huge_pages field:

const flags = MemoryMap.Flags{
    .private = true,
    .populate = true,
    .huge_pages = true,
};
var mapping = try MemoryMap.fromFile("model.gguf", .{ .read = true }, flags);

5.3 Transparent Huge Pages (THP)¶

Linux can also promote 4 KB pages to 2 MB pages automatically. However, THP has unpredictable latency due to background compaction and is generally not recommended for latency-sensitive inference. Prefer explicit huge pages.

6. Performance Impact¶

Load Time Comparison

Measured on a 7 B parameter model (3.8 GB, Q4_0) with an NVMe SSD (sequential read 3.5 GB/s):

Method	Wall Time	Peak RSS
`read()` into `malloc` buffer	1.8 s	7.6 GB
`mmap` + `MAP_POPULATE`	1.1 s	3.8 GB
`mmap` + lazy (no populate)	0.002 s	~0 MB (initial)
`mmap` + `mlock` + huge pages	1.2 s	3.8 GB

The lazy mmap path returns almost instantly -- the cost is amortized across the first forward pass as pages are faulted in. With mlock, all pages are resident before the first token is generated, giving the most predictable latency profile.

7. Platform Considerations¶

7.1 Linux¶

Full mmap, mlock, madvise, MAP_POPULATE, MAP_HUGETLB support.
/proc/self/smaps can be read to inspect per-mapping RSS.
MADV_SEQUENTIAL and MADV_WILLNEED hints available via MemoryMap.advise().

7.2 macOS¶

mmap and mlock available.
No MAP_POPULATE equivalent; use the prefault() method instead.
No MAP_HUGETLB; macOS uses "superpages" which are allocated automatically when the VM system detects large contiguous accesses.
madvise(MADV_WILLNEED) is supported and triggers readahead.

7.3 Windows¶

Windows uses CreateFileMapping + MapViewOfFile instead of POSIX mmap.
VirtualLock replaces mlock.
Large pages (2 MB) require the SeLockMemoryPrivilege privilege and must be requested via MEM_LARGE_PAGES in VirtualAlloc.
ZigLlama's MemoryMap currently targets POSIX; Windows support would require a platform abstraction layer or Zig's std.os.windows APIs.

7.4 Summary Table¶

Feature	Linux	macOS	Windows
`mmap`	Yes	Yes	`MapViewOfFile`
`mlock`	Yes	Yes	`VirtualLock`
Pre-populate	`MAP_POPULATE`	Manual	Manual
Huge pages	`MAP_HUGETLB`	Auto superpages	`MEM_LARGE_PAGES`
`madvise`	Full	Partial	N/A

8. High-Level API: ModelFileMapper¶

For convenience, ZigLlama provides ModelFileMapper which manages multiple memory mappings and applies optimal strategies:

pub const ModelFileMapper = struct {
    mappings: std.ArrayList(MemoryMap),
    allocator: Allocator,

    pub fn loadModelFile(self: *Self, path: []const u8,
                         lock_memory: bool, prefault: bool) !*MemoryMap {
        var mapping = try MemoryMap.fromFile(path,
            .{ .read = true },
            .{ .private = true, .populate = prefault });

        if (lock_memory) mapping.lock() catch |err| { ... };
        if (prefault and !flags.populate) try mapping.prefault();
        mapping.advise(.Sequential) catch {};

        try self.mappings.append(mapping);
        return &self.mappings.items[self.mappings.items.len - 1];
    }
};

The MappingOptimizer.detectOptimalStrategy() function selects between lazy mapping, prefaulting, and memory locking based on available system RAM relative to the model file size.

References¶

Bovet, D. and Cesati, M. Understanding the Linux Kernel. 3^rd ed. O'Reilly, 2005. ↩
Gorman, M. Understanding the Linux Virtual Memory Manager. Prentice Hall, 2004. ↩
Gerganov, G. "llama.cpp -- LLM Inference in C/C++." GitHub, 2023. ↩
Intel. "Using Huge Pages on Linux." Intel Developer Zone, 2022. ↩