Multi-Modal Vision-Language Models¶
Multi-modal models extend language models to process both text and images within a single architecture. Rather than building entirely new models, the dominant approach connects a pre-trained vision encoder (typically a Vision Transformer, ViT) to a pre-trained language model through a learned projection layer. This modular design allows each component to be trained separately and then aligned, reducing the enormous cost of training a multi-modal system from scratch1.
1. Architecture Overview¶
The Vision-Language Paradigm
The modern vision-language model architecture was popularized by LLaVA (Liu et al., 2023)1 and subsequently adopted by models including LLaVA-1.5, Qwen-VL, InternVL, and Gemma's multi-modal variants. The core insight is simple: if images can be represented as sequences of embeddings, they can be processed by the same transformer that handles text tokens.
The high-level pipeline is:
- Split the image into fixed-size patches
- Encode patches with a Vision Transformer to get visual features
- Project visual features into the language model's embedding space
- Concatenate visual tokens with text tokens
- Process the combined sequence with the language model
2. Key Innovations¶
2.1 Vision Transformer (ViT)¶
The Vision Transformer (Dosovitskiy et al., 2021)2 treats an image as a sequence of patches, analogous to how a language model treats text as a sequence of tokens:
Image Patching
For an image \( I \in \mathbb{R}^{H \times W \times C} \) and patch size \( P \):
Each patch \( p_i \in \mathbb{R}^{P \times P \times C} \) is flattened and linearly projected:
A learnable [CLS] token is prepended and positional embeddings are added:
The patch embeddings are then processed by a standard transformer encoder (bidirectional attention, like BERT).
2.2 Cross-Modal Projection¶
The vision encoder produces features in its own embedding space (\( d_v \)-dimensional). The language model operates in a different space (\( d_l \)-dimensional). A projection layer bridges the two:
Vision-Language Projection
Common projection types:
| Type | Formula | Parameters |
|---|---|---|
| Linear | \( h = ZW + b \) | \( d_v \times d_l \) |
| MLP (2-layer) | \( h = \text{GELU}(ZW_1 + b_1)W_2 + b_2 \) | \( d_v \times d_h + d_h \times d_l \) |
| Cross-attention | \( h = \text{CrossAttn}(Q_{\text{lang}}, K_{\text{vis}}, V_{\text{vis}}) \) | Attention params |
LLaVA uses a simple 2-layer MLP, which was found to be sufficient for strong performance1.
2.3 Sequence Construction¶
After projection, the visual tokens are treated identically to text tokens:
The language model processes this combined sequence autoregressively. The visual tokens provide context; the model generates text tokens conditioned on both the image and the text prompt.
3. Architecture Diagram¶
flowchart LR
subgraph VISION["Vision Encoder (ViT)"]
IMG["Input Image\n(H x W x C)"] --> PATCH["Patch Extraction\n(P x P patches)"]
PATCH --> PROJ_V["Patch Projection\n+ Position Embedding"]
PROJ_V --> VIT_ENC["ViT Transformer\nEncoder (L layers)"]
VIT_ENC --> VIS_FEAT["Visual Features\n[N, d_vision]"]
end
subgraph BRIDGE["Cross-Modal Projection"]
VIS_FEAT --> MLP_PROJ["MLP Projection\nd_vision -> d_language"]
MLP_PROJ --> VIS_TOKENS["Visual Tokens\n[N, d_language]"]
end
subgraph LANGUAGE["Language Model (Decoder)"]
TEXT["Text Tokens"] --> TEXT_EMB["Text Embedding"]
VIS_TOKENS --> CONCAT["Concatenate\n[visual ; text]"]
TEXT_EMB --> CONCAT
CONCAT --> LLM["LLM Decoder\n(LLaMA / Gemma / etc.)"]
LLM --> OUTPUT["Generated Text"]
end 4. Configuration Parameters¶
4.1 Vision Encoder¶
| Parameter | CLIP ViT-L/14 | SigLIP-400M |
|---|---|---|
image_size | 336 | 384 |
patch_size | 14 | 14 |
n_patches | 576 | 729 |
d_vision | 1024 | 1152 |
n_layers | 24 | 27 |
n_heads | 16 | 16 |
activation | QuickGELU | GELU |
4.2 Multi-Modal Configuration¶
| Parameter | LLaVA-1.5-7B | LLaVA-1.5-13B | Qwen-VL |
|---|---|---|---|
| Vision encoder | CLIP ViT-L/14 | CLIP ViT-L/14 | ViT-bigG |
| Language model | Vicuna-7B | Vicuna-13B | Qwen-7B |
d_vision | 1024 | 1024 | 1664 |
d_language | 4096 | 5120 | 4096 |
| Projection type | 2-layer MLP | 2-layer MLP | Cross-attention |
n_visual_tokens | 576 | 576 | 256 |
5. Mathematical Formulation¶
5.1 Patch Embedding¶
For image \( I \in \mathbb{R}^{H \times W \times 3} \) with patch size \( P \):
5.2 Vision Transformer Forward Pass¶
For each layer \( l \):
[ Z'^{(l)} = Z^{(l)} + \text{MHA}(\text{LN}(Z^{(l)})) ] [ Z^{(l+1)} = Z'^{(l)} + \text{FFN}(\text{LN}(Z'^{(l)})) ]
5.3 MLP Projection (LLaVA-style)¶
where \( W_1 \in \mathbb{R}^{d_v \times d_h} \), \( W_2 \in \mathbb{R}^{d_h \times d_l} \), and typically \( d_h = d_l \).
5.4 Combined Sequence Processing¶
The language model attends over both visual and text tokens when generating.
5.5 Image Preprocessing¶
Standard Image Preprocessing
Input: Raw image \( I \)
Output: Normalized tensor \( \hat{I} \in \mathbb{R}^{H' \times W' \times 3} \)
- Resize to target resolution (e.g., 336x336) with bicubic interpolation
- Convert to float32, scale to \([0, 1]\)
- Normalize per channel: \( \hat{I}_c = \frac{I_c - \mu_c}{\sigma_c} \) with ImageNet statistics: \( \mu = [0.485, 0.456, 0.406] \), \( \sigma = [0.229, 0.224, 0.225] \)
- Extract \( N = (H'/P)^2 \) non-overlapping patches
6. Zig Implementation¶
6.1 Configuration Structs¶
pub const VisionConfig = struct {
image_size: u32 = 336,
patch_size: u32 = 14,
d_vision: u32 = 1024,
n_layers: u32 = 24,
n_heads: u32 = 16,
d_ff: u32 = 4096,
norm_eps: f32 = 1e-6,
activation: ActivationType = .quick_gelu,
pub fn numPatches(self: VisionConfig) u32 {
const grid = self.image_size / self.patch_size;
return grid * grid;
}
};
pub const ProjectionConfig = struct {
d_vision: u32,
d_language: u32,
d_hidden: ?u32 = null, // null = use d_language
projection_type: enum { linear, mlp, cross_attention } = .mlp,
};
pub const MultiModalConfig = struct {
vision: VisionConfig,
projection: ProjectionConfig,
language_model: anytype, // LLaMA, Gemma, etc.
max_visual_tokens: u32 = 576,
};
6.2 Patch Embedding¶
pub const PatchEmbedding = struct {
proj: Linear, // [patch_size^2 * 3, d_vision]
cls_token: Tensor(f32), // [1, d_vision]
pos_embed: Tensor(f32), // [1 + n_patches, d_vision]
pub fn forward(self: *PatchEmbedding, image: Tensor(f32)) !Tensor(f32) {
const patches = extractPatches(image, self.patch_size);
var embeddings = try self.proj.forward(patches);
// Prepend [CLS] token
var with_cls = try prependToken(self.cls_token, embeddings);
// Add positional embeddings
return elementWiseAdd(with_cls, self.pos_embed);
}
};
6.3 Vision-Language Projection¶
pub const VisionLanguageProjection = struct {
w1: Linear, // [d_vision, d_hidden]
w2: Linear, // [d_hidden, d_language]
pub fn forward(self: *VisionLanguageProjection, vis_features: Tensor(f32)) !Tensor(f32) {
// 2-layer MLP: GELU(x @ W1 + b1) @ W2 + b2
const hidden = gelu(try self.w1.forward(vis_features));
return self.w2.forward(hidden);
}
};
6.4 Multi-Modal Forward Pass¶
pub const MultiModalModel = struct {
vision_encoder: VisionTransformer,
projection: VisionLanguageProjection,
language_model: LLaMAModel,
config: MultiModalConfig,
pub fn forward(
self: *MultiModalModel,
image: ?Tensor(f32), // null for text-only
text_tokens: []const u32,
) ![]f32 {
var combined_embeds: Tensor(f32) = undefined;
if (image) |img| {
// 1. Encode image through ViT
const vis_features = try self.vision_encoder.forward(img);
// 2. Project to language embedding space
const vis_tokens = try self.projection.forward(vis_features);
// 3. Get text embeddings
const text_embeds = self.language_model.embedding.lookup(text_tokens);
// 4. Concatenate: [visual_tokens, text_tokens]
combined_embeds = try concatenate(vis_tokens, text_embeds);
} else {
combined_embeds = self.language_model.embedding.lookup(text_tokens);
}
// 5. Process through language model decoder
return self.language_model.forwardEmbeddings(combined_embeds);
}
};
7. Variants¶
| Model | Vision Encoder | LLM | Projection | Year |
|---|---|---|---|---|
| LLaVA | CLIP ViT-L/14 | Vicuna-13B | Linear | 20231 |
| LLaVA-1.5 | CLIP ViT-L/14 | Vicuna-7B/13B | 2-layer MLP | 2023 |
| LLaVA-NeXT | CLIP ViT-L/14 | Various | MLP | 2024 |
| Qwen-VL | ViT-bigG | Qwen-7B | Cross-Attention | 2023 |
| InternVL | InternViT-6B | InternLM | QLLaMA | 2024 |
| PaliGemma | SigLIP | Gemma-2B | Linear | 2024 |
| Llama 3.2 Vision | ViT | Llama 3.2 | Cross-Attention | 2024 |
8. Educational Value¶
What Multi-Modal Models Teach
-
Modality as token type: The fundamental insight is that images, when patched and projected, are just another type of token. The language model does not know whether a token came from text or an image -- it processes the same embedding vectors. This unification demystifies multi-modal AI.
-
Vision Transformers: Understanding ViT requires the same transformer knowledge used for text models, applied to a different domain. The patch extraction step makes the connection between convolutional receptive fields and attention explicit.
-
Cross-modal alignment: The projection layer is a lesson in representation alignment -- mapping two independently learned embedding spaces into a shared space where visual and textual concepts can interact. This is a core concept in multi-modal learning.
-
Modular architecture design: The vision encoder, projection, and language model are independent components. Students learn how to compose pre-trained modules, a practical skill in modern ML engineering.
-
Training stages: Multi-modal models are typically trained in stages (freeze vision, train projection; then fine-tune everything). This progressive training strategy teaches efficient use of compute and the importance of curriculum design.
9. References¶
-
Liu, H. et al. "Visual Instruction Tuning." NeurIPS, 2023. ↩↩↩↩
-
Dosovitskiy, A. et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR, 2021. ↩
-
Radford, A. et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML, 2021. ↩
-
Zhai, X. et al. "Sigmoid Loss for Language Image Pre-Training." ICCV, 2023. ↩