Mistral¶

Mistral is a high-performance 7B language model that introduced sliding window attention for efficient long-context processing.

Overview¶

Property	Value
Architecture	Decoder-only Transformer
Parameters	7B
Context Length	8K (standard), 32K (extended)
Attention	Sliding Window + Grouped Query
Position Encoding	RoPE
Activation	SwiGLU
Normalization	RMSNorm

Quick Start¶

use unillm::models_v2::mistral::{MistralModelV2, MistralConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, GenerationConfig};

// Load model
let weights = WeightLoader::from_gguf("mistral-7b.gguf")?;
let config = MistralConfig::from_gguf_metadata(weights.metadata())?;
let model = MistralModelV2::from_weights(config, weights)?;

// Generate
let response = model.generate(
    "Explain the theory of relativity:",
    &GenerationConfig::default(),
)?;

Configuration¶

model_config!(MistralConfig {
    vocab_size: usize = 32000,
    hidden_size: usize = 4096,
    intermediate_size: usize = 14336,
    num_hidden_layers: usize = 32,
    num_attention_heads: usize = 32,
    num_key_value_heads: usize = 8,
    max_position_embeddings: usize = 32768,
    sliding_window: usize = 4096,
    rope_theta: f32 = 10000.0,
    rms_norm_eps: f32 = 1e-5,
});

Model Specifications¶

Property	Value
hidden_size	4096
num_layers	32
num_attention_heads	32
num_key_value_heads	8
head_dim	128
intermediate_size	14336

Features¶

Sliding Window Attention¶

Efficient attention for long sequences:

let config = MistralConfig {
    sliding_window: 4096,  // Each token attends to 4K previous tokens
    max_position_embeddings: 32768,  // But supports 32K total
    ..Default::default()
};

The sliding window allows: - Memory efficiency - Fixed attention window size - Long context - Process sequences beyond window with rolling cache - Speed - Less computation per attention layer

Grouped Query Attention¶

Mistral uses 8 KV heads for 32 query heads (4:1 ratio):

let config = MistralConfig {
    num_attention_heads: 32,
    num_key_value_heads: 8,  // 4:1 GQA ratio
    ..Default::default()
};

Larger Intermediate Size¶

More capacity in FFN:

let config = MistralConfig {
    hidden_size: 4096,
    intermediate_size: 14336,  // 3.5x hidden (vs 2.67x in LLaMA)
    ..Default::default()
};

Mistral Variants¶

Mistral 7B¶

Base model (2023)
8K sliding window
Strong general performance

Mistral 7B Instruct¶

Instruction-tuned
Chat-optimized
Same architecture

Mixtral 8x7B¶

MoE version. See Mixtral documentation.

Loading from Ollama¶

use unillm::ollama::OllamaRegistry;

// Base model
let path = OllamaRegistry::pull("mistral:7b")?;

// Instruct
let path = OllamaRegistry::pull("mistral:7b-instruct")?;

// Quantized
let path = OllamaRegistry::pull("mistral:7b-q4_0")?;

Generation Examples¶

Chat Format¶

let prompt = "<s>[INST] What is machine learning? [/INST]";

let config = GenerationConfig {
    max_new_tokens: 256,
    temperature: 0.7,
    stop_sequences: vec!["</s>".to_string()],
    ..Default::default()
};

let response = model.generate(prompt, &config)?;

Multi-turn Conversation¶

let prompt = r#"<s>[INST] Hi! [/INST] Hello! How can I help?</s>
[INST] What's 2+2? [/INST]"#;

let response = model.generate(prompt, &config)?;

System Prompts (Mistral Instruct v0.2+)¶

let prompt = r#"<s>[INST] <<SYS>>
You are a helpful coding assistant.
<</SYS>>

Write a Python hello world [/INST]"#;

let response = model.generate(prompt, &config)?;

Memory Requirements¶

Format	Memory
F32	28 GB
F16	14 GB
Q8_0	7 GB
Q4_K_M	4 GB

Performance¶

Mistral 7B outperforms LLaMA 2 13B on most benchmarks:

Benchmark	Mistral 7B	LLaMA 2 7B	LLaMA 2 13B
MMLU	60.1	45.3	54.8
HellaSwag	81.3	77.2	80.7
Arc-C	55.5	45.9	49.4
HumanEval	30.5	12.8	18.3

Use Cases¶

Ideal For¶

General chat - Strong instruction following
Code assistance - Good at coding tasks
Long documents - Sliding window handles well
Production - Good quality/size ratio

Comparison with Alternatives¶

Use Case	Best Choice
Smallest size	Phi-3 Mini
Best 7B quality	Mistral 7B
Longer context	Mistral 7B (32K)
MoE efficiency	Mixtral 8x7B

Best Practices¶

Use Instruct version for chat/assistant use
Use proper prompt format - [INST] tags matter
Leverage sliding window for long documents
Consider Mixtral for higher quality needs