Vision-Language Models: Architecture and Implementation

Visual Language Models (VLMs) represent a paradigm shift in medical LLMs, combining computer vision and natural language processing to create systems that understand and reason across modalities. These models process visual information alongside text, enabling sophisticated tasks like medical image analysis, visual question answering, and multimodal reasoning critical for clinical decision support.

Core Architecture

┌─────────────────────────────────────────────────────────┐
│                     Input Layer                         │
├──────────────────────┬──────────────────────────────────┤
│                      │                                  │
│    Image Input       │         Text Input               │
│        ↓             │              ↓                   │
│ ┌──────────────┐     │     ┌──────────────┐             │
│ │Image Encoder │     │     │Text Tokenizer│             │
│ └──────┬───────┘     │     └──────┬───────┘             │
│        ↓             │              ↓                   │
│ ┌──────────────┐     │     ┌──────────────┐             │
│ │Vision Trans- │     │     │Text Embedding│             │
│ │former (ViT)  │     │     │    Layer     │             │
│ └──────┬───────┘     │     └──────┬───────┘             │
│        ↓             │              ↓                   │
│ ┌──────────────┐     │     ┌──────────────┐             │
│ │Image Feature │     │     │Text Feature  │             │
│ │  Extraction  │     │     │  Extraction  │             │
│ └──────┬───────┘     │     └──────┬───────┘             │
│        └─────────────┴─────────────┘                    │
│                      ↓                                  │
│         ┌───────────────────────┐                       │
│         │   Cross-Modal Fusion  │                       │
│         │      (Attention)      │                       │
│         └───────────┬───────────┘                       │
│                     ↓                                   │
│         ┌───────────────────────┐                       │
│         │  Multimodal Encoder   │                       │
│         │   (Transformer)       │                       │
│         └───────────┬───────────┘                       │
│                     ↓                                   │
│         ┌───────────────────────┐                       │
│         │    Output Head(s)     │                       │
│         └───────────┬───────────┘                       │
│                     ↓                                   │
│              Task Output                                │
│    (Caption/Answer/Classification)                      │
└─────────────────────────────────────────────────────────┘

Key Components

Vision Encoder

Processes raw images into feature representations. Most modern VLMs use Vision Transformers (ViT) that:

Divide images into patches (typically 16x16 or 32x32 pixels)
Treat each patch as a token
Apply self-attention across patches
Output visual embeddings

Text Encoder

Converts text into embeddings using:

Tokenization (splitting text into subwords or words)
Embedding layers that map tokens to vectors
Positional encodings for sequence order

Combines visual and text features through:

Cross-attention mechanisms where text queries attend to image features
Concatenation or projection of features into shared space
Learnable alignment between modalities

Multimodal Encoder

Processes combined features using:

Stacked transformer layers
Self-attention and feed-forward networks
Layer normalization and residual connections

Common Architectures

CLIP-based Models

Separate encoders for images and text
Contrastive learning to align representations
Examples: CLIP, ALIGN, Florence
See Toward a Holistic Evaluation of Robustness in CLIP Models for comprehensive robustness evaluation

Unified Transformer Models

Single transformer processes both modalities
Images converted to token sequences
Examples: Flamingo, BLIP-2, LLaVA
See Robust-LLaVA - On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models for robustness improvements

Encoder-Decoder Models

Encoder processes image and text
Decoder generates output sequences
Examples: SimVLM, CoCa, GIT

Training Approaches

Pre-training Tasks

Image-Text Matching (ITM)
- Binary classification: does image match text?
- Hard negative mining for challenging examples
- Cross-entropy loss with balanced sampling
Masked Language Modeling (MLM)
- Mask 15% of text tokens
- Predict masked tokens using image context
- Enables bidirectional language understanding
Image-Text Contrastive Learning (ITC)
- InfoNCE loss for representation alignment
- Temperature-scaled similarity matching
- Large batch sizes (32K+) for diverse negatives
- Mathematical formulation: L_ITC = -log(exp(sim(I,T)/τ) / Σexp(sim(I,T')/τ))
Image Captioning
- Autoregressive text generation
- Teacher forcing during training
- Beam search or nucleus sampling for inference

Advanced Training Objectives

Masked Image Modeling (MIM)
- Reconstruct masked image patches
- Enhances visual representation learning
- Often uses discrete visual tokens (VQ-VAE)
Visual Grounding
- Predict bounding boxes for text mentions
- Requires region-level annotations
- IoU-based loss functions
Next Token Prediction
- Unified generative objective
- Enables zero-shot task transfer
- Used in models like Flamingo, BLIP-2

Technical Implementation Details

Model Scaling

Parameter Distribution

Vision encoder: 300M-5B parameters (ViT-L to ViT-G)
Language model: 1B-70B parameters
Cross-modal layers: 100M-1B parameters
Typical total: 2B-80B parameters

Computational Requirements

Training: 1000-10000 GPU hours (A100/H100)
Inference: 10-100ms per image-text pair
Memory: 8-80GB depending on model size
Quantization: INT8/INT4 reduces memory 2-4x

Optimization Techniques

Efficient Attention

# Flash Attention pseudo-code
def flash_attention(Q, K, V):
    # Block-wise computation
    for block in blocks:
        attn = softmax(Q[block] @ K[block].T / sqrt(d))
        out[block] = attn @ V[block]
    return out

Gradient Checkpointing

Trade compute for memory
Re-compute activations during backward pass
Enables larger batch sizes

Mixed Precision Training

FP16/BF16 for forward pass
FP32 for optimizer states
Dynamic loss scaling

Medical Domain Adaptations

Domain-Specific Pre-training

Medical Image Augmentations
- Intensity variations for X-rays
- Anatomical structure preservation
- Realistic noise injection
Clinical Text Processing
- Medical terminology tokenization
- Abbreviation expansion
- UMLS concept linking
Specialized Architectures
- Region-of-interest attention
- Multi-scale feature fusion
- Uncertainty quantification heads

Applications

Core Tasks

Visual Question Answering: Answer questions about images
Image Captioning: Generate text descriptions
Visual Reasoning: Solve visual puzzles and problems
Image-Text Retrieval: Find matching images or text

Advanced Applications

Visual Dialog: Multi-turn conversations about images
Visual Grounding: Locate objects mentioned in text
Scene Understanding: Comprehensive analysis of visual scenes
Document Analysis: Extract information from documents and charts

Security and Robustness

VLMs face significant security challenges from adversarial attacks:

Adversarial Vulnerabilities: See On Evaluating Adversarial Robustness of Large Vision-Language Models for black-box attack methods
Attack Techniques: Comprehensive guide in vlm-attacks covering FGSM, PGD, and multimodal attacks
Theoretical Foundation: linear-hypothesis-explanation explains why neural networks are vulnerable
Defense Strategies: Robustness Notes outlines current defense techniques and medical VLM considerations
Healthcare Applications: Medical VLMs cover medical multimodal models and clinical deployments

RobMed LLM Notes

Explorer

03-vlm-basics