LLaMA¶
LLaMA (Large Language Model Meta AI) is Meta's family of open-weight language models, known for strong performance and extensive community support.
Overview¶
| Property | Value |
|---|---|
| Architecture | Decoder-only Transformer |
| Parameters | 7B, 13B, 30B, 65B (v1), 7B-70B (v2/v3) |
| Context Length | 2K-4K (v1/v2), up to 128K (v3) |
| Attention | Multi-head (v1), Grouped Query (v2+) |
| Position Encoding | RoPE (Rotary Position Embedding) |
| Activation | SwiGLU |
| Normalization | RMSNorm |
Quick Start¶
use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, GenerationConfig};
// Load model
let weights = WeightLoader::from_gguf("llama-7b.gguf")?;
let config = LlamaConfig::from_gguf_metadata(weights.metadata())?;
let model = LlamaModelV2::from_weights(config, weights)?;
// Generate
let response = model.generate(
"Explain quantum computing:",
&GenerationConfig::default(),
)?;
println!("{}", response);
Configuration¶
model_config!(LlamaConfig {
vocab_size: usize = 32000,
hidden_size: usize = 4096,
intermediate_size: usize = 11008,
num_hidden_layers: usize = 32,
num_attention_heads: usize = 32,
num_key_value_heads: usize = 32,
max_position_embeddings: usize = 2048,
rope_theta: f32 = 10000.0,
rope_scaling: Option<String> = None,
rms_norm_eps: f32 = 1e-6,
pad_token_id: i64 = 0,
bos_token_id: i64 = 1,
eos_token_id: i64 = 2,
});
Model Sizes¶
| Variant | hidden_size | num_layers | num_heads | kv_heads |
|---|---|---|---|---|
| LLaMA 7B | 4096 | 32 | 32 | 32 |
| LLaMA 13B | 5120 | 40 | 40 | 40 |
| LLaMA 2 7B | 4096 | 32 | 32 | 32 |
| LLaMA 2 13B | 5120 | 40 | 40 | 40 |
| LLaMA 2 70B | 8192 | 80 | 64 | 8 |
| LLaMA 3 8B | 4096 | 32 | 32 | 8 |
| LLaMA 3 70B | 8192 | 80 | 64 | 8 |
Features¶
Grouped Query Attention (GQA)¶
LLaMA 2 70B and LLaMA 3 use GQA for memory efficiency:
// GQA configuration
let config = LlamaConfig {
num_attention_heads: 64, // Query heads
num_key_value_heads: 8, // KV heads (8:1 ratio)
..Default::default()
};
RoPE (Rotary Position Embedding)¶
Position information is encoded using rotary embeddings:
// Standard RoPE
let config = LlamaConfig {
rope_theta: 10000.0,
rope_scaling: None,
..Default::default()
};
// Extended context (LLaMA 3)
let config = LlamaConfig {
rope_theta: 500000.0, // Higher theta for longer context
rope_scaling: Some("linear".to_string()),
..Default::default()
};
SwiGLU Activation¶
The FFN uses SwiGLU (Swish-Gated Linear Unit):
// In MLP forward pass
let gate = ops_fn::linear(&hidden, &gate_proj, None)?;
let up = ops_fn::linear(&hidden, &up_proj, None)?;
let gate = ops_fn::silu(&gate)?; // Swish activation
let hidden = ops_fn::mul(&gate, &up)?;
let output = ops_fn::linear(&hidden, &down_proj, None)?;
Loading from Ollama¶
use unillm::ollama::OllamaRegistry;
// LLaMA 2
let path = OllamaRegistry::pull("llama2:7b")?;
// LLaMA 3
let path = OllamaRegistry::pull("llama3:8b")?;
// Quantized versions
let path = OllamaRegistry::pull("llama3:8b-q4_0")?;
Generation Examples¶
Chat Completion¶
let prompt = r#"<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"#;
let config = GenerationConfig {
max_new_tokens: 256,
temperature: 0.7,
top_p: 0.9,
stop_sequences: vec!["<|eot_id|>".to_string()],
..Default::default()
};
let response = model.generate(prompt, &config)?;
Code Generation¶
let config = GenerationConfig {
temperature: 0.2, // Lower for code
max_new_tokens: 512,
..Default::default()
};
let response = model.generate(
"Write a Python function to sort a list:",
&config,
)?;
Memory Requirements¶
| Model | F32 | F16 | Q8_0 | Q4_K_M |
|---|---|---|---|---|
| 7B | 28 GB | 14 GB | 7 GB | 4 GB |
| 13B | 52 GB | 26 GB | 13 GB | 7 GB |
| 70B | 280 GB | 140 GB | 70 GB | 40 GB |
Variants¶
LLaMA 1¶
- Original release (2023)
- Sizes: 7B, 13B, 30B, 65B
- Context: 2048 tokens
LLaMA 2¶
- Improved training
- Sizes: 7B, 13B, 70B
- Context: 4096 tokens
- 70B uses GQA
LLaMA 3¶
- Latest release (2024)
- Sizes: 8B, 70B
- Context: 8K standard, up to 128K
- All sizes use GQA
CodeLlama¶
- Code-specialized LLaMA 2
- See CodeLlama documentation
Best Practices¶
- Use quantized models for consumer hardware
- Match prompt format to training (especially chat models)
- Use GQA models (70B, LLaMA 3) for memory efficiency
- Set appropriate context length to avoid memory issues