Qwen2-VL¶
Qwen2-VL is Alibaba's vision-language model with native multimodal architecture, supporting images, videos, and text in a unified framework.
Overview¶
| Property | Value |
|---|---|
| Architecture | Native Multimodal Transformer |
| Parameters | 2B, 7B, 72B |
| Context Length | 32K tokens |
| Image Resolution | Dynamic (up to 16K pixels) |
| Video Support | Yes (frame extraction) |
| Position Encoding | M-RoPE (Multimodal RoPE) |
Quick Start¶
use unillm::models_v2::qwen2_vl::{Qwen2VLModelV2, Qwen2VLConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, ModelInputs, GenerationConfig};
// Load model
let weights = WeightLoader::from_gguf("qwen2-vl-7b.gguf")?;
let config = Qwen2VLConfig::from_gguf_metadata(weights.metadata())?;
let model = Qwen2VLModelV2::from_weights(config, weights)?;
// Generate response
let inputs = ModelInputs::Multimodal {
input_ids: tokens,
pixel_values: Some(image),
attention_mask: None,
image_mask: None,
};
let response = model.generate_from_inputs(&inputs, &GenerationConfig::default())?;
Configuration¶
model_config!(Qwen2VLConfig {
// Vision encoder
vision_hidden_size: usize = 1280,
vision_intermediate_size: usize = 5120,
vision_num_layers: usize = 32,
vision_num_heads: usize = 16,
vision_patch_size: usize = 14,
temporal_patch_size: usize = 2,
spatial_merge_size: usize = 2,
// Language model
vocab_size: usize = 151936,
hidden_size: usize = 3584,
intermediate_size: usize = 18944,
num_hidden_layers: usize = 28,
num_attention_heads: usize = 28,
num_key_value_heads: usize = 4,
max_position_embeddings: usize = 32768,
rope_theta: f32 = 1000000.0,
rms_norm_eps: f32 = 1e-6,
// Multimodal
image_token_id: usize = 151655,
video_token_id: usize = 151656,
});
Model Sizes¶
| Variant | Vision Layers | LLM Layers | Total Params |
|---|---|---|---|
| Qwen2-VL 2B | 32 | 28 | 2.2B |
| Qwen2-VL 7B | 32 | 28 | 8.3B |
| Qwen2-VL 72B | 32 | 80 | 73B |
Key Features¶
Dynamic Resolution¶
Unlike fixed-resolution models, Qwen2-VL handles any image size:
// Process images at native resolution
let small_image = load_image("small.jpg")?; // 224x224
let large_image = load_image("large.jpg")?; // 1920x1080
// Both work without resizing
let response1 = model.process(&small_image, "Describe this")?;
let response2 = model.process(&large_image, "Describe this")?;
M-RoPE (Multimodal RoPE)¶
Unified position encoding for text, images, and video:
struct MRoPE {
temporal_rope: RoPE, // For video frames
height_rope: RoPE, // For image height
width_rope: RoPE, // For image width
}
fn compute_mrope(&self, positions: &Positions3D) -> Tensor {
let temporal = self.temporal_rope.forward(&positions.t)?;
let height = self.height_rope.forward(&positions.h)?;
let width = self.width_rope.forward(&positions.w)?;
concat(&[temporal, height, width])
}
Native Video Understanding¶
Process videos directly:
// Extract frames at specified FPS
let video_frames = extract_frames("video.mp4", fps: 2.0)?;
let inputs = ModelInputs::Multimodal {
input_ids: tokens,
pixel_values: Some(video_frames),
..Default::default()
};
let response = model.generate(&"Describe what happens in this video", &config)?;
Architecture¶
Vision Encoder¶
Image/Video
│
▼
┌─────────────────┐
│ Patch Embedding │ 14x14 patches + temporal
└─────────────────┘
│
▼
┌─────────────────┐
│ ViT Layers (32) │ With 3D position encoding
└─────────────────┘
│
▼
┌─────────────────┐
│ Spatial Merge │ 2x2 spatial downsampling
└─────────────────┘
│
▼
Vision Tokens
Multimodal Fusion¶
fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs> {
// Encode vision
let vision_features = self.vision_encoder.forward(&inputs.pixel_values)?;
// Merge spatially
let merged = self.spatial_merge(&vision_features)?;
// Get text embeddings
let text_embeds = self.get_text_embeddings(&inputs.input_ids)?;
// Insert vision tokens at image positions
let hidden = self.merge_multimodal(&text_embeds, &merged, &inputs.image_mask)?;
// Forward through LLM
self.language_model.forward_from_embeddings(&hidden)
}
Generation Examples¶
Image Understanding¶
let prompt = "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
What is in this image?<|im_end|>
<|im_start|>assistant
";
let config = GenerationConfig {
max_new_tokens: 256,
temperature: 0.7,
stop_sequences: vec!["<|im_end|>".to_string()],
..Default::default()
};
let response = model.generate_multimodal(&prompt, &image, &config)?;
Video Analysis¶
let prompt = "<|im_start|>user
<|vision_start|><|video_pad|><|vision_end|>
Summarize what happens in this video.<|im_end|>
<|im_start|>assistant
";
let frames = extract_frames("video.mp4", 2.0)?; // 2 FPS
let response = model.generate_multimodal(&prompt, &frames, &config)?;
Document OCR¶
let prompt = "<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
Extract all text from this document.<|im_end|>
<|im_start|>assistant
";
let response = model.generate_multimodal(&prompt, &document_image, &config)?;
Multi-Image Comparison¶
let prompt = "<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
<|vision_start|><|image_pad|><|vision_end|>
Compare these two images. What are the differences?<|im_end|>
<|im_start|>assistant
";
let images = concat_images(&[image1, image2])?;
let response = model.generate_multimodal(&prompt, &images, &config)?;
Memory Requirements¶
| Variant | F16 | Q8_0 | Q4_K_M |
|---|---|---|---|
| 2B | 5 GB | 3 GB | 2 GB |
| 7B | 17 GB | 9 GB | 5 GB |
| 72B | 145 GB | 75 GB | 42 GB |
Note: Vision encoder adds ~1-2 GB depending on image resolution.
Performance¶
Benchmarks¶
| Benchmark | Qwen2-VL 7B | LLaVA 1.6 | GPT-4V |
|---|---|---|---|
| VQAv2 | 83.0 | 79.8 | N/A |
| DocVQA | 94.5 | 78.2 | 87.2 |
| ChartQA | 83.0 | 67.3 | 78.1 |
| TextVQA | 84.3 | 72.1 | N/A |
Strengths¶
- OCR: Excellent text recognition
- Document: Strong document understanding
- Charts: Good chart/graph reading
- Video: Native video support
Loading from Ollama¶
use unillm::ollama::OllamaRegistry;
// Qwen2-VL
let path = OllamaRegistry::pull("qwen2-vl:7b")?;
// Quantized
let path = OllamaRegistry::pull("qwen2-vl:7b-q4_0")?;
Best Practices¶
- Use native resolution - Don't resize images unnecessarily
- Proper formatting - Use vision tokens correctly
- Video FPS - 1-4 FPS usually sufficient
- Batch frames - Process video frames together
Use Cases¶
Ideal For¶
- Document OCR - Best-in-class text extraction
- Chart analysis - Strong graph understanding
- Video QA - Native video support
- Multilingual - Good non-English support
Comparison¶
| Task | Best Choice |
|---|---|
| Fastest | Qwen2-VL 2B |
| Best quality | Qwen2-VL 72B |
| Document/OCR | Qwen2-VL (any) |
| General VQA | Qwen2-VL 7B |