LLaVA¶

LLaVA (Large Language-and-Vision Assistant) is a multimodal model that combines a vision encoder with a large language model for visual understanding and conversation.

Overview¶

Property	Value
Architecture	Vision Encoder + Projector + LLM
Vision Encoder	CLIP ViT-L/14
Language Model	LLaMA / Vicuna
Parameters	7B, 13B
Image Size	336x336
Context Length	4096 tokens

Quick Start¶

use unillm::models_v2::llava::{LlavaModelV2, LlavaConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, ModelInputs, GenerationConfig};

// Load model
let weights = WeightLoader::from_gguf("llava-v1.5-7b.gguf")?;
let config = LlavaConfig::from_gguf_metadata(weights.metadata())?;
let model = LlavaModelV2::from_weights(config, weights)?;

// Prepare multimodal input
let inputs = ModelInputs::Multimodal {
    input_ids: text_tokens,
    pixel_values: Some(image_tensor),
    attention_mask: None,
    image_mask: None,
};

// Generate response
let response = model.generate_from_inputs(&inputs, &GenerationConfig::default())?;

Configuration¶

model_config!(LlavaConfig {
    // Vision encoder
    vision_hidden_size: usize = 1024,
    vision_intermediate_size: usize = 4096,
    vision_num_layers: usize = 24,
    vision_num_heads: usize = 16,
    vision_patch_size: usize = 14,
    image_size: usize = 336,

    // Projector
    projector_hidden_size: usize = 4096,
    projector_type: String = "mlp2x_gelu".to_string(),

    // Language model
    vocab_size: usize = 32000,
    hidden_size: usize = 4096,
    intermediate_size: usize = 11008,
    num_hidden_layers: usize = 32,
    num_attention_heads: usize = 32,
    num_key_value_heads: usize = 32,
    max_position_embeddings: usize = 4096,
    rms_norm_eps: f32 = 1e-5,
});

Architecture¶

Components¶

Image ─────────────────────────────────────────────────┐
       │                                               │
       ▼                                               │
┌─────────────────┐                                    │
│ Vision Encoder  │  CLIP ViT-L/14                     │
│ (frozen/tuned)  │  224x224 or 336x336               │
└─────────────────┘                                    │
       │                                               │
       ▼                                               │
┌─────────────────┐                                    │
│ Projector       │  MLP: vision_dim → llm_dim        │
│ (trained)       │                                    │
└─────────────────┘                                    │
       │                                               │
       ▼                                               │
┌─────────────────────────────────────────────────────┐
│                    LLM Decoder                       │
│  [image tokens] + [text tokens] → response          │
│  LLaMA / Vicuna / Mistral                           │
└─────────────────────────────────────────────────────┘

Projector Types¶

// MLP with GELU (LLaVA 1.5)
let projector = MLP {
    layers: vec![
        Linear(vision_dim, llm_dim),
        GELU,
        Linear(llm_dim, llm_dim),
    ]
};

// Simple linear (LLaVA 1.0)
let projector = Linear(vision_dim, llm_dim);

LLaVA Versions¶

LLaVA 1.0¶

Initial release
Linear projector
CLIP ViT-L/14
LLaMA 7B/13B

LLaVA 1.5¶

MLP projector (better)
Higher resolution (336x336)
Improved instruction following
More training data

LLaVA 1.6 (LLaVA-NeXT)¶

Multiple image resolutions
Dynamic aspect ratio
Improved OCR
Mistral/Qwen backbones

Generation Examples¶

Visual Q&A¶

let prompt = "<image>\nWhat is shown in this image?";

let inputs = ModelInputs::Multimodal {
    input_ids: tokenizer.encode(&prompt)?,
    pixel_values: Some(image),
    attention_mask: None,
    image_mask: None,
};

let config = GenerationConfig {
    max_new_tokens: 256,
    temperature: 0.7,
    ..Default::default()
};

let answer = model.generate_from_inputs(&inputs, &config)?;

Detailed Description¶

let prompt = "<image>\nDescribe this image in detail, including colors, objects, and their positions.";

let response = model.generate_from_inputs(&inputs, &config)?;

OCR and Document Understanding¶

let prompt = "<image>\nRead and transcribe all text visible in this image.";

let response = model.generate_from_inputs(&inputs, &config)?;

Conversation¶

let conversation = vec![
    "<image>",
    "USER: What do you see in this image?",
    "ASSISTANT: I see a cat sitting on a windowsill.",
    "USER: What color is the cat?",
    "ASSISTANT:",
];

let prompt = conversation.join("\n");
let response = model.generate(&prompt, &config)?;

Image Processing¶

Preprocessing¶

fn preprocess_image(image: &Image, target_size: usize) -> Result<Tensor> {
    // Resize to target size
    let resized = image.resize(target_size, target_size)?;

    // Normalize with CLIP stats
    let mean = [0.48145466, 0.4578275, 0.40821073];
    let std = [0.26862954, 0.26130258, 0.27577711];

    let tensor = resized.to_tensor()?;
    let normalized = normalize(&tensor, &mean, &std)?;

    Ok(normalized)
}

Multiple Images (LLaVA-NeXT)¶

let prompt = "<image>\n<image>\nCompare these two images.";

let inputs = ModelInputs::Multimodal {
    input_ids: tokenizer.encode(&prompt)?,
    pixel_values: Some(concat_images(&[image1, image2])?),
    ..
};

Memory Requirements¶

Variant	F16	Q8_0	Q4_K_M
LLaVA 7B	15 GB	8 GB	5 GB
LLaVA 13B	27 GB	14 GB	8 GB

Note: Vision encoder adds ~1 GB on top.

Performance Tips¶

Batch images - Process multiple images together
Cache vision features - Reuse for same image, different prompts
Quantize LLM - Vision encoder can stay F16
Resolution trade-off - 224px faster, 336px better quality

Use Cases¶

Ideal For¶

Visual Q&A - Answer questions about images
Image description - Generate detailed captions
Document OCR - Read text from images
Visual reasoning - Compare, analyze, explain

Comparison¶

Task	LLaVA 1.5	GPT-4V	Qwen2-VL
General VQA	Good	Excellent	Very Good
OCR	Good	Excellent	Excellent
Reasoning	Good	Excellent	Very Good
Speed	Fast	Slow (API)	Medium

Best Practices¶

Use appropriate resolution - Match training (336px for 1.5)
Clear prompts - Be specific about what you want
Image placeholder - Use <image> where image goes
Chat format - Use proper USER/ASSISTANT format