LLaVA
LLaVA (Large Language-and-Vision Assistant) is a multimodal model that combines a vision encoder with a large language model for visual understanding and conversation.
Overview
| Property |
Value |
| Architecture |
Vision Encoder + Projector + LLM |
| Vision Encoder |
CLIP ViT-L/14 |
| Language Model |
LLaMA / Vicuna |
| Parameters |
7B, 13B |
| Image Size |
336x336 |
| Context Length |
4096 tokens |
Quick Start
use unillm::models_v2::llava::{LlavaModelV2, LlavaConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, ModelInputs, GenerationConfig};
// Load model
let weights = WeightLoader::from_gguf("llava-v1.5-7b.gguf")?;
let config = LlavaConfig::from_gguf_metadata(weights.metadata())?;
let model = LlavaModelV2::from_weights(config, weights)?;
// Prepare multimodal input
let inputs = ModelInputs::Multimodal {
input_ids: text_tokens,
pixel_values: Some(image_tensor),
attention_mask: None,
image_mask: None,
};
// Generate response
let response = model.generate_from_inputs(&inputs, &GenerationConfig::default())?;
Configuration
model_config!(LlavaConfig {
// Vision encoder
vision_hidden_size: usize = 1024,
vision_intermediate_size: usize = 4096,
vision_num_layers: usize = 24,
vision_num_heads: usize = 16,
vision_patch_size: usize = 14,
image_size: usize = 336,
// Projector
projector_hidden_size: usize = 4096,
projector_type: String = "mlp2x_gelu".to_string(),
// Language model
vocab_size: usize = 32000,
hidden_size: usize = 4096,
intermediate_size: usize = 11008,
num_hidden_layers: usize = 32,
num_attention_heads: usize = 32,
num_key_value_heads: usize = 32,
max_position_embeddings: usize = 4096,
rms_norm_eps: f32 = 1e-5,
});
Architecture
Components
Image ─────────────────────────────────────────────────┐
│ │
▼ │
┌─────────────────┐ │
│ Vision Encoder │ CLIP ViT-L/14 │
│ (frozen/tuned) │ 224x224 or 336x336 │
└─────────────────┘ │
│ │
▼ │
┌─────────────────┐ │
│ Projector │ MLP: vision_dim → llm_dim │
│ (trained) │ │
└─────────────────┘ │
│ │
▼ │
┌─────────────────────────────────────────────────────┐
│ LLM Decoder │
│ [image tokens] + [text tokens] → response │
│ LLaMA / Vicuna / Mistral │
└─────────────────────────────────────────────────────┘
Projector Types
// MLP with GELU (LLaVA 1.5)
let projector = MLP {
layers: vec![
Linear(vision_dim, llm_dim),
GELU,
Linear(llm_dim, llm_dim),
]
};
// Simple linear (LLaVA 1.0)
let projector = Linear(vision_dim, llm_dim);
LLaVA Versions
LLaVA 1.0
- Initial release
- Linear projector
- CLIP ViT-L/14
- LLaMA 7B/13B
LLaVA 1.5
- MLP projector (better)
- Higher resolution (336x336)
- Improved instruction following
- More training data
LLaVA 1.6 (LLaVA-NeXT)
- Multiple image resolutions
- Dynamic aspect ratio
- Improved OCR
- Mistral/Qwen backbones
Generation Examples
Visual Q&A
let prompt = "<image>\nWhat is shown in this image?";
let inputs = ModelInputs::Multimodal {
input_ids: tokenizer.encode(&prompt)?,
pixel_values: Some(image),
attention_mask: None,
image_mask: None,
};
let config = GenerationConfig {
max_new_tokens: 256,
temperature: 0.7,
..Default::default()
};
let answer = model.generate_from_inputs(&inputs, &config)?;
Detailed Description
let prompt = "<image>\nDescribe this image in detail, including colors, objects, and their positions.";
let response = model.generate_from_inputs(&inputs, &config)?;
OCR and Document Understanding
let prompt = "<image>\nRead and transcribe all text visible in this image.";
let response = model.generate_from_inputs(&inputs, &config)?;
Conversation
let conversation = vec![
"<image>",
"USER: What do you see in this image?",
"ASSISTANT: I see a cat sitting on a windowsill.",
"USER: What color is the cat?",
"ASSISTANT:",
];
let prompt = conversation.join("\n");
let response = model.generate(&prompt, &config)?;
Image Processing
Preprocessing
fn preprocess_image(image: &Image, target_size: usize) -> Result<Tensor> {
// Resize to target size
let resized = image.resize(target_size, target_size)?;
// Normalize with CLIP stats
let mean = [0.48145466, 0.4578275, 0.40821073];
let std = [0.26862954, 0.26130258, 0.27577711];
let tensor = resized.to_tensor()?;
let normalized = normalize(&tensor, &mean, &std)?;
Ok(normalized)
}
Multiple Images (LLaVA-NeXT)
let prompt = "<image>\n<image>\nCompare these two images.";
let inputs = ModelInputs::Multimodal {
input_ids: tokenizer.encode(&prompt)?,
pixel_values: Some(concat_images(&[image1, image2])?),
..
};
Memory Requirements
| Variant |
F16 |
Q8_0 |
Q4_K_M |
| LLaVA 7B |
15 GB |
8 GB |
5 GB |
| LLaVA 13B |
27 GB |
14 GB |
8 GB |
Note: Vision encoder adds ~1 GB on top.
- Batch images - Process multiple images together
- Cache vision features - Reuse for same image, different prompts
- Quantize LLM - Vision encoder can stay F16
- Resolution trade-off - 224px faster, 336px better quality
Use Cases
Ideal For
- Visual Q&A - Answer questions about images
- Image description - Generate detailed captions
- Document OCR - Read text from images
- Visual reasoning - Compare, analyze, explain
Comparison
| Task |
LLaVA 1.5 |
GPT-4V |
Qwen2-VL |
| General VQA |
Good |
Excellent |
Very Good |
| OCR |
Good |
Excellent |
Excellent |
| Reasoning |
Good |
Excellent |
Very Good |
| Speed |
Fast |
Slow (API) |
Medium |
Best Practices
- Use appropriate resolution - Match training (336px for 1.5)
- Clear prompts - Be specific about what you want
- Image placeholder - Use
<image> where image goes
- Chat format - Use proper USER/ASSISTANT format
References