GPU Acceleration¶

GPU acceleration dramatically improves inference speed, especially for larger models. Mullama supports multiple GPU backends through llama.cpp's build system.

Supported Backends¶

Backend	GPU Hardware	Platform	Environment Variable
CUDA	NVIDIA GeForce, RTX, Tesla, A100, H100	Linux, Windows	`LLAMA_CUDA=1`
Metal	Apple M1/M2/M3/M4 (integrated GPU)	macOS	`LLAMA_METAL=1`
ROCm (HIP)	AMD Radeon RX 6000+, Instinct MI	Linux	`LLAMA_HIPBLAS=1`
OpenCL (CLBlast)	Any OpenCL-capable GPU	Linux, macOS, Windows	`LLAMA_CLBLAST=1`

Which Backend Should I Use?

NVIDIA GPU: Use CUDA (best performance)
Apple Silicon Mac: Use Metal (automatic, no setup needed)
AMD GPU on Linux: Use ROCm
Other GPUs or older hardware: Use OpenCL (broadest compatibility, lower performance)

NVIDIA CUDA¶

CUDA provides the best performance for NVIDIA GPUs. Requires CUDA Toolkit 11.0 or later (12.x recommended).

Driver Requirements¶

GPU Generation	Minimum Driver	Recommended Driver
RTX 40-series (Ada)	525.60+	545.xx+
RTX 30-series (Ampere)	470.42+	545.xx+
RTX 20-series (Turing)	440.33+	545.xx+
GTX 10-series (Pascal)	410.48+	535.xx+
Data Center (A100, H100)	450.80+	545.xx+

Step 1: Install NVIDIA Drivers¶

Ubuntu / DebianFedoraWindows

sudo apt update
sudo apt install -y nvidia-driver-545

# Reboot to load the driver
sudo reboot

sudo dnf install -y akmod-nvidia
sudo reboot

Download the latest driver from nvidia.com/drivers and run the installer.

Step 2: Install CUDA Toolkit¶

Ubuntu / DebianWindows

# Add NVIDIA CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install CUDA Toolkit
sudo apt install -y cuda-toolkit-12-4

# Configure environment
echo 'export CUDA_PATH="/usr/local/cuda"' >> ~/.bashrc
echo 'export PATH="$CUDA_PATH/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="$CUDA_PATH/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc

Download and install from developer.nvidia.com/cuda-downloads.

# Set environment (usually done by installer)
$env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
[Environment]::SetEnvironmentVariable("CUDA_PATH", $env:CUDA_PATH, "Machine")

Step 3: Verify Installation¶

# Check driver is loaded
nvidia-smi

# Check CUDA toolkit version
nvcc --version

Expected output:

nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 12.4, V12.4.131

Step 4: Build Mullama with CUDA¶

export LLAMA_CUDA=1
cargo build --release --features full

Persist the Environment Variable

Add to your shell profile so it is set on every build:

echo 'export LLAMA_CUDA=1' >> ~/.bashrc
source ~/.bashrc

Apple Metal¶

Metal provides GPU acceleration on Apple Silicon Macs (M1, M2, M3, M4 series). It leverages the unified memory architecture for efficient CPU-GPU data sharing.

Setup¶

Metal support is automatically available on macOS with Apple Silicon. No additional drivers or toolkits are needed.

export LLAMA_METAL=1
cargo build --release --features full

No Extra Setup Needed

On Apple Silicon, Metal acceleration works out of the box. The environment variable simply enables the Metal compute backend in llama.cpp during compilation.

Verify Metal Support¶

system_profiler SPDisplaysDataType | grep Metal

Expected output:

Metal Family: Supported, Metal GPUFamily Apple 9

Metal-Specific Configuration¶

# Enable Metal (required for build)
export LLAMA_METAL=1

# Disable Metal debug output for production (faster)
export GGML_METAL_NDEBUG=1

Unified Memory Advantage

On Apple Silicon, the GPU shares memory with the CPU. This means you can set n_gpu_layers = -1 to offload ALL layers without worrying about separate VRAM limits. The only constraint is your total system RAM.

Intel Macs¶

Metal is also available on Intel Macs with dedicated AMD GPUs, but performance gains are more modest. Consider using OpenCL (LLAMA_CLBLAST=1) as an alternative on Intel Macs.

AMD ROCm¶

ROCm (Radeon Open Compute) provides GPU acceleration for AMD GPUs on Linux.

Supported GPUs¶

AMD Instinct MI-series (MI100, MI210, MI250X, MI300X)
AMD Radeon RX 7000 series (RDNA 3)
AMD Radeon RX 6000 series (RDNA 2)
AMD Radeon Pro W6000/W7000 series

Step 1: Install ROCm¶

Ubuntu 22.04Ubuntu 24.04Fedora

# Add ROCm repository
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60000-1_all.deb

# Install ROCm
sudo amdgpu-install --usecase=rocm

# Add user to required groups
sudo usermod -a -G render,video $USER

wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/noble/amdgpu-install_6.0.60000-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60000-1_all.deb
sudo amdgpu-install --usecase=rocm
sudo usermod -a -G render,video $USER

sudo dnf install -y rocm-hip-devel rocblas-devel hipblas-devel
sudo usermod -a -G render,video $USER

Step 2: Configure Environment¶

echo 'export ROCM_PATH="/opt/rocm"' >> ~/.bashrc
echo 'export PATH="$ROCM_PATH/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="$ROCM_PATH/lib:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc

Step 3: Reboot and Verify¶

sudo reboot

# After reboot
rocm-smi
rocminfo | grep "Name:"

Reboot Required

A reboot is required after installing ROCm for the kernel modules to load properly.

Step 4: Build with ROCm¶

export LLAMA_HIPBLAS=1
cargo build --release --features full

OpenCL (CLBlast)¶

OpenCL provides broad GPU compatibility across vendors. Performance is generally lower than native backends (CUDA, Metal, ROCm) but works on a wider range of hardware, including older GPUs and integrated graphics.

Setup¶

LinuxmacOSWindows

# Install OpenCL development packages
sudo apt install -y \
    opencl-headers \
    ocl-icd-opencl-dev \
    clinfo

# Install CLBlast
sudo apt install -y libclblast-dev

# Verify OpenCL devices
clinfo | head -20

brew install clblast

OpenCL runtime is typically included with GPU drivers. Install CLBlast from GitHub releases.

Build with OpenCL¶

export LLAMA_CLBLAST=1
cargo build --release --features full

When to Use OpenCL

Use OpenCL when:

You have an older GPU not supported by CUDA/ROCm
You have integrated graphics (Intel HD/Iris, AMD APU)
You need cross-vendor GPU support
Native backends are unavailable on your platform

GPU Layer Offloading¶

The n_gpu_layers parameter controls how many transformer layers are offloaded from CPU to GPU. This is the single most important parameter for GPU performance.

How It Works¶

A language model consists of many transformer layers stacked sequentially. Each layer that runs on the GPU avoids a CPU computation and memory transfer. More layers on the GPU means faster inference.

Value	Behavior
`0`	CPU only -- no GPU acceleration
`1` to `N`	Offload the first N layers to GPU
`-1`	Offload ALL layers to GPU (recommended if VRAM allows)

Usage Examples¶

Node.jsPythonRustCLI

const { Model } = require('mullama');

// Offload all layers to GPU
const model = new Model('model.gguf', {
    nGpuLayers: -1
});

// Offload 35 layers (partial, for limited VRAM)
const model2 = new Model('model.gguf', {
    nGpuLayers: 35
});

from mullama import Model

# Offload all layers to GPU
model = Model("model.gguf", n_gpu_layers=-1)

# Offload 35 layers (partial, for limited VRAM)
model2 = Model("model.gguf", n_gpu_layers=35)

use mullama::ModelParams;

// Offload all layers
let params = ModelParams::default()
    .with_n_gpu_layers(-1);

// Partial offload
let params = ModelParams::default()
    .with_n_gpu_layers(35);

# All layers on GPU
mullama run llama3.2:1b --n-gpu-layers -1 "Hello!"

# Partial offload
mullama run llama3.2:1b --n-gpu-layers 20 "Hello!"

Finding the Right Number

Start with -1 (all layers). If you get out-of-memory errors, reduce the value by 5 until it fits. Each layer uses approximately model_size / total_layers of VRAM.

Memory Considerations¶

VRAM Requirements by Model Size¶

Approximate VRAM needed for full GPU offload (n_gpu_layers = -1):

Model Size	Q4_K_M	Q5_K_M	Q6_K	Q8_0	F16
1B	~0.9 GB	~1.0 GB	~1.2 GB	~1.5 GB	~2.5 GB
3B	~2.2 GB	~2.5 GB	~2.9 GB	~3.6 GB	~6.5 GB
7B	~4.5 GB	~5.1 GB	~5.9 GB	~7.5 GB	~14 GB
13B	~8.0 GB	~9.2 GB	~10.5 GB	~13.5 GB	~26 GB
30B	~19 GB	~22 GB	~25 GB	~32 GB	~60 GB
70B	~40 GB	~46 GB	~53 GB	~68 GB	~140 GB

Recommended GPUs by Model Size¶

Model Size	Minimum GPU	Recommended GPU
1B-3B	Any 4 GB GPU	GTX 1650, M1
7B	6 GB VRAM	RTX 3060 12GB, M1 Pro
13B	10 GB VRAM	RTX 3080, RTX 4070, M2 Pro
30B	24 GB VRAM	RTX 3090, RTX 4090, M2 Max
70B	48+ GB VRAM	2x RTX 4090, A100, M2 Ultra

Context Size Affects VRAM

The VRAM figures above are for model weights only. Additional VRAM is needed for the KV cache, which scales with context size:

2048 context: +0.5-1 GB
4096 context: +1-2 GB
8192 context: +2-4 GB
32768 context: +8-16 GB

Multi-GPU Configuration¶

For systems with multiple GPUs, Mullama supports distributing model layers across devices.

Tensor Split¶

Control how model weight tensors are distributed across GPUs:

Node.jsPythonRust

const model = new Model('large-model.gguf', {
    nGpuLayers: -1,
    tensorSplit: [0.6, 0.4]  // 60% GPU 0, 40% GPU 1
});

model = Model("large-model.gguf",
    n_gpu_layers=-1,
    tensor_split=[0.6, 0.4]  # 60% GPU 0, 40% GPU 1
)

use mullama::{ModelParams, SplitMode};

// Layer split: distribute layers across GPUs
let params = ModelParams::default()
    .with_n_gpu_layers(-1)
    .with_split_mode(SplitMode::Layer)
    .with_tensor_split(&[0.6, 0.4]);  // 60% GPU 0, 40% GPU 1

// Row split: split individual tensors (better for very large layers)
let params = ModelParams::default()
    .with_n_gpu_layers(-1)
    .with_split_mode(SplitMode::Row)
    .with_tensor_split(&[0.5, 0.5]);  // Equal split

Split Modes¶

Mode	Description	Best For
Layer	Distributes entire layers across GPUs	Unequal GPU sizes
Row	Splits individual weight matrices across GPUs	Equal GPUs, very large models

Selecting a Primary GPU¶

let params = ModelParams::default()
    .with_n_gpu_layers(-1)
    .with_main_gpu(1);  // Use GPU 1 as primary (0-indexed)

Multi-GPU Examples¶

# Two GPUs: 70/30 split (GPU 0 has more VRAM)
# GPU 0: RTX 4090 (24 GB), GPU 1: RTX 3090 (24 GB)
mullama run llama3.3:70b --n-gpu-layers -1 --tensor-split "0.5,0.5"

# Three GPUs: equal split
mullama run llama3.3:70b --n-gpu-layers -1 --tensor-split "0.33,0.33,0.34"

Verification¶

Check GPU Detection¶

# NVIDIA
nvidia-smi
nvcc --version

# AMD ROCm
rocm-smi
rocminfo

# Apple Metal
system_profiler SPDisplaysDataType | grep Metal

# OpenCL (any vendor)
clinfo

Verify GPU Is Being Used¶

Node.jsPythonRust

const { Model, gpu } = require('mullama');

console.log('CUDA available:', gpu.cudaAvailable());
console.log('Metal available:', gpu.metalAvailable());

const model = new Model('model.gguf', { nGpuLayers: -1 });
console.log('GPU layers loaded:', model.gpuLayers);

from mullama import Model, gpu

print(f"CUDA available: {gpu.cuda_available()}")
print(f"Metal available: {gpu.metal_available()}")

model = Model("model.gguf", n_gpu_layers=-1)
print(f"GPU layers loaded: {model.gpu_layers}")

use mullama::gpu;

fn check_gpu_status() {
    if gpu::cuda_available() {
        println!("CUDA available");
        println!("  Devices: {}", gpu::cuda_device_count());
        for i in 0..gpu::cuda_device_count() {
            let (used, total) = gpu::cuda_memory_info(i);
            println!("  GPU {}: {:.1} GB / {:.1} GB",
                i, used as f64 / 1e9, total as f64 / 1e9);
        }
    }

    if gpu::metal_available() {
        println!("Metal available (Apple Silicon)");
    }

    if gpu::rocm_available() {
        println!("ROCm available");
        println!("  Devices: {}", gpu::rocm_device_count());
    }
}

Monitor GPU During Inference¶

# NVIDIA: watch GPU utilization in real-time
watch -n 1 nvidia-smi

# AMD: monitor ROCm devices
watch -n 1 rocm-smi

# All platforms: check if GPU layers were loaded (in log output)
MULLAMA_LOG=debug mullama run llama3.2:1b "Hello"

What to Look For

When GPU is active, you should see:

nvidia-smi: GPU utilization > 0%, memory usage increases when model loads
Log output: Messages like offloaded 32/32 layers to GPU
Performance: Significantly higher tokens/second compared to CPU-only

Performance Tuning¶

Batch Size¶

Larger batch sizes improve GPU utilization but require more VRAM:

Node.jsPythonRust

const model = new Model('model.gguf', {
    nGpuLayers: -1,
    contextSize: 4096,
    batchSize: 1024  // Higher = better GPU utilization
});

model = Model("model.gguf",
    n_gpu_layers=-1,
    context_size=4096,
    batch_size=1024  # Higher = better GPU utilization
)

let ctx_params = ContextParams::default()
    .with_n_ctx(4096)
    .with_n_batch(1024)    // Higher = better GPU utilization
    .with_n_threads(1);    // Fewer CPU threads when GPU is primary

CPU Thread Count¶

When using GPU acceleration, reduce CPU threads since most compute is on the GPU:

let ctx_params = ContextParams::default()
    .with_n_threads(2)          // Minimal CPU threads for GPU workloads
    .with_n_threads_batch(4);   // Slightly more for batch processing

Troubleshooting¶

CUDA out of memory

Reduce GPU layers or context size:

# Reduce layers
model = Model("model.gguf", n_gpu_layers=20)

# Or use a smaller context
model = Model("model.gguf", n_gpu_layers=-1, context_size=1024)

Or use a more aggressively quantized model (Q4_0 instead of Q5_K_M).

GPU not detected at runtime

Verify the environment variable was set before building:
```
echo $LLAMA_CUDA  # Should print "1"
```

Rebuild from clean:

cargo clean
export LLAMA_CUDA=1
cargo build --release --features full

Check that the GPU driver is loaded:

nvidia-smi  # Should show GPU info, not an error

Metal performance is slow

Disable debug output:

export GGML_METAL_NDEBUG=1

Use -1 for n_gpu_layers to offload all layers. Apple Silicon unified memory can handle this.

ROCm build fails

Ensure HIP development packages are installed:

sudo apt install -y hip-dev rocblas-dev hipblas-dev

Verify your GPU is supported:

rocminfo | grep "Marketing Name"

OpenCL performance is poor

OpenCL/CLBlast is the slowest GPU backend. Consider:

Switching to CUDA (NVIDIA) or ROCm (AMD) if available
Using Metal on macOS
Ensuring CLBlast (not just OpenCL headers) is properly installed
Trying different n_gpu_layers values -- sometimes partial offload is better with OpenCL

Multi-GPU not splitting correctly

Verify all GPUs are detected:

nvidia-smi -L  # Should list all GPUs

Ensure tensor_split values sum to 1.0 and match the number of GPUs.

Next Steps

Your First Project -- Build a chatbot with GPU-accelerated inference
Installation -- Feature flags and build options
Streaming Guide -- Real-time token generation