LLaMA¶

LLaMA (Large Language Model Meta AI) is Meta's family of open-weight language models, known for strong performance and extensive community support.

Overview¶

Property	Value
Architecture	Decoder-only Transformer
Parameters	7B, 13B, 30B, 65B (v1), 7B-70B (v2/v3)
Context Length	2K-4K (v1/v2), up to 128K (v3)
Attention	Multi-head (v1), Grouped Query (v2+)
Position Encoding	RoPE (Rotary Position Embedding)
Activation	SwiGLU
Normalization	RMSNorm

Quick Start¶

use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, GenerationConfig};

// Load model
let weights = WeightLoader::from_gguf("llama-7b.gguf")?;
let config = LlamaConfig::from_gguf_metadata(weights.metadata())?;
let model = LlamaModelV2::from_weights(config, weights)?;

// Generate
let response = model.generate(
    "Explain quantum computing:",
    &GenerationConfig::default(),
)?;
println!("{}", response);

Configuration¶

model_config!(LlamaConfig {
    vocab_size: usize = 32000,
    hidden_size: usize = 4096,
    intermediate_size: usize = 11008,
    num_hidden_layers: usize = 32,
    num_attention_heads: usize = 32,
    num_key_value_heads: usize = 32,
    max_position_embeddings: usize = 2048,
    rope_theta: f32 = 10000.0,
    rope_scaling: Option<String> = None,
    rms_norm_eps: f32 = 1e-6,
    pad_token_id: i64 = 0,
    bos_token_id: i64 = 1,
    eos_token_id: i64 = 2,
});

Model Sizes¶

Variant	hidden_size	num_layers	num_heads	kv_heads
LLaMA 7B	4096	32	32	32
LLaMA 13B	5120	40	40	40
LLaMA 2 7B	4096	32	32	32
LLaMA 2 13B	5120	40	40	40
LLaMA 2 70B	8192	80	64	8
LLaMA 3 8B	4096	32	32	8
LLaMA 3 70B	8192	80	64	8

Features¶

Grouped Query Attention (GQA)¶

LLaMA 2 70B and LLaMA 3 use GQA for memory efficiency:

// GQA configuration
let config = LlamaConfig {
    num_attention_heads: 64,    // Query heads
    num_key_value_heads: 8,     // KV heads (8:1 ratio)
    ..Default::default()
};

RoPE (Rotary Position Embedding)¶

Position information is encoded using rotary embeddings:

// Standard RoPE
let config = LlamaConfig {
    rope_theta: 10000.0,
    rope_scaling: None,
    ..Default::default()
};

// Extended context (LLaMA 3)
let config = LlamaConfig {
    rope_theta: 500000.0,  // Higher theta for longer context
    rope_scaling: Some("linear".to_string()),
    ..Default::default()
};

SwiGLU Activation¶

The FFN uses SwiGLU (Swish-Gated Linear Unit):

// In MLP forward pass
let gate = ops_fn::linear(&hidden, &gate_proj, None)?;
let up = ops_fn::linear(&hidden, &up_proj, None)?;
let gate = ops_fn::silu(&gate)?;  // Swish activation
let hidden = ops_fn::mul(&gate, &up)?;
let output = ops_fn::linear(&hidden, &down_proj, None)?;

Loading from Ollama¶

use unillm::ollama::OllamaRegistry;

// LLaMA 2
let path = OllamaRegistry::pull("llama2:7b")?;

// LLaMA 3
let path = OllamaRegistry::pull("llama3:8b")?;

// Quantized versions
let path = OllamaRegistry::pull("llama3:8b-q4_0")?;

Generation Examples¶

Chat Completion¶

let prompt = r#"<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

"#;

let config = GenerationConfig {
    max_new_tokens: 256,
    temperature: 0.7,
    top_p: 0.9,
    stop_sequences: vec!["<|eot_id|>".to_string()],
    ..Default::default()
};

let response = model.generate(prompt, &config)?;

Code Generation¶

let config = GenerationConfig {
    temperature: 0.2,  // Lower for code
    max_new_tokens: 512,
    ..Default::default()
};

let response = model.generate(
    "Write a Python function to sort a list:",
    &config,
)?;

Memory Requirements¶

Model	F32	F16	Q8_0	Q4_K_M
7B	28 GB	14 GB	7 GB	4 GB
13B	52 GB	26 GB	13 GB	7 GB
70B	280 GB	140 GB	70 GB	40 GB

Variants¶

LLaMA 1¶

Original release (2023)
Sizes: 7B, 13B, 30B, 65B
Context: 2048 tokens

LLaMA 2¶

Improved training
Sizes: 7B, 13B, 70B
Context: 4096 tokens
70B uses GQA

LLaMA 3¶

Latest release (2024)
Sizes: 8B, 70B
Context: 8K standard, up to 128K
All sizes use GQA

CodeLlama¶

Code-specialized LLaMA 2
See CodeLlama documentation

Best Practices¶

Use quantized models for consumer hardware
Match prompt format to training (especially chat models)
Use GQA models (70B, LLaMA 3) for memory efficiency
Set appropriate context length to avoid memory issues