GPU Acceleration¶
GPU acceleration dramatically improves inference speed, especially for larger models. Mullama supports multiple GPU backends through llama.cpp's build system.
Supported Backends¶
| Backend | GPU Hardware | Platform | Environment Variable |
|---|---|---|---|
| CUDA | NVIDIA GeForce, RTX, Tesla, A100, H100 | Linux, Windows | LLAMA_CUDA=1 |
| Metal | Apple M1/M2/M3/M4 (integrated GPU) | macOS | LLAMA_METAL=1 |
| ROCm (HIP) | AMD Radeon RX 6000+, Instinct MI | Linux | LLAMA_HIPBLAS=1 |
| OpenCL (CLBlast) | Any OpenCL-capable GPU | Linux, macOS, Windows | LLAMA_CLBLAST=1 |
Which Backend Should I Use?
- NVIDIA GPU: Use CUDA (best performance)
- Apple Silicon Mac: Use Metal (automatic, no setup needed)
- AMD GPU on Linux: Use ROCm
- Other GPUs or older hardware: Use OpenCL (broadest compatibility, lower performance)
NVIDIA CUDA¶
CUDA provides the best performance for NVIDIA GPUs. Requires CUDA Toolkit 11.0 or later (12.x recommended).
Driver Requirements¶
| GPU Generation | Minimum Driver | Recommended Driver |
|---|---|---|
| RTX 40-series (Ada) | 525.60+ | 545.xx+ |
| RTX 30-series (Ampere) | 470.42+ | 545.xx+ |
| RTX 20-series (Turing) | 440.33+ | 545.xx+ |
| GTX 10-series (Pascal) | 410.48+ | 535.xx+ |
| Data Center (A100, H100) | 450.80+ | 545.xx+ |
Step 1: Install NVIDIA Drivers¶
Download the latest driver from nvidia.com/drivers and run the installer.
Step 2: Install CUDA Toolkit¶
# Add NVIDIA CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
# Install CUDA Toolkit
sudo apt install -y cuda-toolkit-12-4
# Configure environment
echo 'export CUDA_PATH="/usr/local/cuda"' >> ~/.bashrc
echo 'export PATH="$CUDA_PATH/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="$CUDA_PATH/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc
Download and install from developer.nvidia.com/cuda-downloads.
Step 3: Verify Installation¶
Expected output:
Step 4: Build Mullama with CUDA¶
Persist the Environment Variable
Add to your shell profile so it is set on every build:
Apple Metal¶
Metal provides GPU acceleration on Apple Silicon Macs (M1, M2, M3, M4 series). It leverages the unified memory architecture for efficient CPU-GPU data sharing.
Setup¶
Metal support is automatically available on macOS with Apple Silicon. No additional drivers or toolkits are needed.
No Extra Setup Needed
On Apple Silicon, Metal acceleration works out of the box. The environment variable simply enables the Metal compute backend in llama.cpp during compilation.
Verify Metal Support¶
Expected output:
Metal-Specific Configuration¶
# Enable Metal (required for build)
export LLAMA_METAL=1
# Disable Metal debug output for production (faster)
export GGML_METAL_NDEBUG=1
Unified Memory Advantage
On Apple Silicon, the GPU shares memory with the CPU. This means you can set n_gpu_layers = -1 to offload ALL layers without worrying about separate VRAM limits. The only constraint is your total system RAM.
Intel Macs¶
Metal is also available on Intel Macs with dedicated AMD GPUs, but performance gains are more modest. Consider using OpenCL (LLAMA_CLBLAST=1) as an alternative on Intel Macs.
AMD ROCm¶
ROCm (Radeon Open Compute) provides GPU acceleration for AMD GPUs on Linux.
Supported GPUs¶
- AMD Instinct MI-series (MI100, MI210, MI250X, MI300X)
- AMD Radeon RX 7000 series (RDNA 3)
- AMD Radeon RX 6000 series (RDNA 2)
- AMD Radeon Pro W6000/W7000 series
Step 1: Install ROCm¶
Step 2: Configure Environment¶
echo 'export ROCM_PATH="/opt/rocm"' >> ~/.bashrc
echo 'export PATH="$ROCM_PATH/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="$ROCM_PATH/lib:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc
Step 3: Reboot and Verify¶
Reboot Required
A reboot is required after installing ROCm for the kernel modules to load properly.
Step 4: Build with ROCm¶
OpenCL (CLBlast)¶
OpenCL provides broad GPU compatibility across vendors. Performance is generally lower than native backends (CUDA, Metal, ROCm) but works on a wider range of hardware, including older GPUs and integrated graphics.
Setup¶
OpenCL runtime is typically included with GPU drivers. Install CLBlast from GitHub releases.
Build with OpenCL¶
When to Use OpenCL
Use OpenCL when:
- You have an older GPU not supported by CUDA/ROCm
- You have integrated graphics (Intel HD/Iris, AMD APU)
- You need cross-vendor GPU support
- Native backends are unavailable on your platform
GPU Layer Offloading¶
The n_gpu_layers parameter controls how many transformer layers are offloaded from CPU to GPU. This is the single most important parameter for GPU performance.
How It Works¶
A language model consists of many transformer layers stacked sequentially. Each layer that runs on the GPU avoids a CPU computation and memory transfer. More layers on the GPU means faster inference.
| Value | Behavior |
|---|---|
0 |
CPU only -- no GPU acceleration |
1 to N |
Offload the first N layers to GPU |
-1 |
Offload ALL layers to GPU (recommended if VRAM allows) |
Usage Examples¶
Finding the Right Number
Start with -1 (all layers). If you get out-of-memory errors, reduce the value by 5 until it fits. Each layer uses approximately model_size / total_layers of VRAM.
Memory Considerations¶
VRAM Requirements by Model Size¶
Approximate VRAM needed for full GPU offload (n_gpu_layers = -1):
| Model Size | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|
| 1B | ~0.9 GB | ~1.0 GB | ~1.2 GB | ~1.5 GB | ~2.5 GB |
| 3B | ~2.2 GB | ~2.5 GB | ~2.9 GB | ~3.6 GB | ~6.5 GB |
| 7B | ~4.5 GB | ~5.1 GB | ~5.9 GB | ~7.5 GB | ~14 GB |
| 13B | ~8.0 GB | ~9.2 GB | ~10.5 GB | ~13.5 GB | ~26 GB |
| 30B | ~19 GB | ~22 GB | ~25 GB | ~32 GB | ~60 GB |
| 70B | ~40 GB | ~46 GB | ~53 GB | ~68 GB | ~140 GB |
Recommended GPUs by Model Size¶
| Model Size | Minimum GPU | Recommended GPU |
|---|---|---|
| 1B-3B | Any 4 GB GPU | GTX 1650, M1 |
| 7B | 6 GB VRAM | RTX 3060 12GB, M1 Pro |
| 13B | 10 GB VRAM | RTX 3080, RTX 4070, M2 Pro |
| 30B | 24 GB VRAM | RTX 3090, RTX 4090, M2 Max |
| 70B | 48+ GB VRAM | 2x RTX 4090, A100, M2 Ultra |
Context Size Affects VRAM
The VRAM figures above are for model weights only. Additional VRAM is needed for the KV cache, which scales with context size:
- 2048 context: +0.5-1 GB
- 4096 context: +1-2 GB
- 8192 context: +2-4 GB
- 32768 context: +8-16 GB
Multi-GPU Configuration¶
For systems with multiple GPUs, Mullama supports distributing model layers across devices.
Tensor Split¶
Control how model weight tensors are distributed across GPUs:
use mullama::{ModelParams, SplitMode};
// Layer split: distribute layers across GPUs
let params = ModelParams::default()
.with_n_gpu_layers(-1)
.with_split_mode(SplitMode::Layer)
.with_tensor_split(&[0.6, 0.4]); // 60% GPU 0, 40% GPU 1
// Row split: split individual tensors (better for very large layers)
let params = ModelParams::default()
.with_n_gpu_layers(-1)
.with_split_mode(SplitMode::Row)
.with_tensor_split(&[0.5, 0.5]); // Equal split
Split Modes¶
| Mode | Description | Best For |
|---|---|---|
| Layer | Distributes entire layers across GPUs | Unequal GPU sizes |
| Row | Splits individual weight matrices across GPUs | Equal GPUs, very large models |
Selecting a Primary GPU¶
let params = ModelParams::default()
.with_n_gpu_layers(-1)
.with_main_gpu(1); // Use GPU 1 as primary (0-indexed)
Multi-GPU Examples¶
# Two GPUs: 70/30 split (GPU 0 has more VRAM)
# GPU 0: RTX 4090 (24 GB), GPU 1: RTX 3090 (24 GB)
mullama run llama3.3:70b --n-gpu-layers -1 --tensor-split "0.5,0.5"
# Three GPUs: equal split
mullama run llama3.3:70b --n-gpu-layers -1 --tensor-split "0.33,0.33,0.34"
Verification¶
Check GPU Detection¶
# NVIDIA
nvidia-smi
nvcc --version
# AMD ROCm
rocm-smi
rocminfo
# Apple Metal
system_profiler SPDisplaysDataType | grep Metal
# OpenCL (any vendor)
clinfo
Verify GPU Is Being Used¶
use mullama::gpu;
fn check_gpu_status() {
if gpu::cuda_available() {
println!("CUDA available");
println!(" Devices: {}", gpu::cuda_device_count());
for i in 0..gpu::cuda_device_count() {
let (used, total) = gpu::cuda_memory_info(i);
println!(" GPU {}: {:.1} GB / {:.1} GB",
i, used as f64 / 1e9, total as f64 / 1e9);
}
}
if gpu::metal_available() {
println!("Metal available (Apple Silicon)");
}
if gpu::rocm_available() {
println!("ROCm available");
println!(" Devices: {}", gpu::rocm_device_count());
}
}
Monitor GPU During Inference¶
# NVIDIA: watch GPU utilization in real-time
watch -n 1 nvidia-smi
# AMD: monitor ROCm devices
watch -n 1 rocm-smi
# All platforms: check if GPU layers were loaded (in log output)
MULLAMA_LOG=debug mullama run llama3.2:1b "Hello"
What to Look For
When GPU is active, you should see:
- nvidia-smi: GPU utilization > 0%, memory usage increases when model loads
- Log output: Messages like
offloaded 32/32 layers to GPU - Performance: Significantly higher tokens/second compared to CPU-only
Performance Tuning¶
Batch Size¶
Larger batch sizes improve GPU utilization but require more VRAM:
CPU Thread Count¶
When using GPU acceleration, reduce CPU threads since most compute is on the GPU:
let ctx_params = ContextParams::default()
.with_n_threads(2) // Minimal CPU threads for GPU workloads
.with_n_threads_batch(4); // Slightly more for batch processing
Troubleshooting¶
CUDA out of memory
Reduce GPU layers or context size:
# Reduce layers
model = Model("model.gguf", n_gpu_layers=20)
# Or use a smaller context
model = Model("model.gguf", n_gpu_layers=-1, context_size=1024)
Or use a more aggressively quantized model (Q4_0 instead of Q5_K_M).
GPU not detected at runtime
-
Verify the environment variable was set before building:
-
Rebuild from clean:
-
Check that the GPU driver is loaded:
Metal performance is slow
Disable debug output:
Use -1 for n_gpu_layers to offload all layers. Apple Silicon unified memory can handle this.
ROCm build fails
Ensure HIP development packages are installed:
Verify your GPU is supported:
OpenCL performance is poor
OpenCL/CLBlast is the slowest GPU backend. Consider:
- Switching to CUDA (NVIDIA) or ROCm (AMD) if available
- Using Metal on macOS
- Ensuring CLBlast (not just OpenCL headers) is properly installed
- Trying different
n_gpu_layersvalues -- sometimes partial offload is better with OpenCL
Multi-GPU not splitting correctly
Verify all GPUs are detected:
Ensure tensor_split values sum to 1.0 and match the number of GPUs.
Next Steps
- Your First Project -- Build a chatbot with GPU-accelerated inference
- Installation -- Feature flags and build options
- Streaming Guide -- Real-time token generation