Gemma¶

Gemma is Google's family of lightweight, open-weight language models built from the same research as Gemini models.

Overview¶

Property	Value
Architecture	Decoder-only Transformer
Parameters	2B, 7B (v1), 2B, 9B, 27B (v2)
Context Length	8192 tokens
Attention	Multi-Query Attention
Position Encoding	RoPE
Activation	GELU
Normalization	RMSNorm

Quick Start¶

use unillm::models_v2::gemma::{GemmaModelV2, GemmaConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, GenerationConfig};

// Load model
let weights = WeightLoader::from_gguf("gemma-2b.gguf")?;
let config = GemmaConfig::from_gguf_metadata(weights.metadata())?;
let model = GemmaModelV2::from_weights(config, weights)?;

// Generate
let response = model.generate(
    "Write a haiku about programming:",
    &GenerationConfig::default(),
)?;

Configuration¶

model_config!(GemmaConfig {
    vocab_size: usize = 256000,
    hidden_size: usize = 2048,
    intermediate_size: usize = 16384,
    num_hidden_layers: usize = 18,
    num_attention_heads: usize = 8,
    num_key_value_heads: usize = 1,
    head_dim: usize = 256,
    max_position_embeddings: usize = 8192,
    rope_theta: f32 = 10000.0,
    rms_norm_eps: f32 = 1e-6,
});

Model Sizes¶

Variant	hidden_size	num_layers	num_heads	kv_heads
Gemma 2B	2048	18	8	1
Gemma 7B	3072	28	16	16
Gemma 2 2B	2304	26	8	4
Gemma 2 9B	3584	42	16	8
Gemma 2 27B	4608	46	32	16

Features¶

Multi-Query Attention¶

Gemma 2B uses MQA for efficiency:

let config = GemmaConfig {
    num_attention_heads: 8,
    num_key_value_heads: 1,  // Single KV head
    ..Default::default()
};

Large Head Dimension¶

Uses larger head dimension for better representation:

let config = GemmaConfig {
    head_dim: 256,  // Larger than typical 64 or 128
    ..Default::default()
};

Extended Vocabulary¶

Large vocabulary for multilingual support:

let config = GemmaConfig {
    vocab_size: 256000,  // Very large vocabulary
    ..Default::default()
};

Gemma Versions¶

Gemma 1¶

Initial release (2024)
Sizes: 2B, 7B
8K context
Strong for size

Gemma 2¶

Improved architecture
Sizes: 2B, 9B, 27B
Better instruction following
Sliding window attention option

Loading from Ollama¶

use unillm::ollama::OllamaRegistry;

// Gemma
let path = OllamaRegistry::pull("gemma:2b")?;

// Gemma 2
let path = OllamaRegistry::pull("gemma2:9b")?;

// Quantized
let path = OllamaRegistry::pull("gemma2:9b-q4_0")?;

Generation Examples¶

Basic Generation¶

let config = GenerationConfig {
    max_new_tokens: 256,
    temperature: 0.7,
    top_p: 0.95,
    ..Default::default()
};

let response = model.generate("Explain photosynthesis:", &config)?;

Instruction Format¶

let prompt = "<start_of_turn>user
What is the capital of France?<end_of_turn>
<start_of_turn>model
";

let config = GenerationConfig {
    stop_sequences: vec!["<end_of_turn>".to_string()],
    ..Default::default()
};

let response = model.generate(prompt, &config)?;

Memory Requirements¶

Model	F32	F16	Q8_0	Q4_K_M
2B	8 GB	4 GB	2 GB	1.2 GB
7B	28 GB	14 GB	7 GB	4 GB
9B	36 GB	18 GB	9 GB	5 GB
27B	108 GB	54 GB	27 GB	15 GB

Use Cases¶

Ideal For¶

Edge deployment - Small model size
Quick inference - Fast on consumer hardware
Mobile applications - 2B fits in memory
Learning/experimentation - Good starter model

Comparison¶

Use Case	Recommended
Smallest footprint	Gemma 2B
Best quality/size	Gemma 2 9B
Multi-turn chat	Gemma 2 27B

Best Practices¶

Use Gemma 2 for better instruction following
Use 2B variant for resource-constrained environments
Apply proper chat formatting for conversational use
Consider quantization for deployment