Large Language Models: From Architecture to Evaluation
Large Language Models (LLMs) have transformed how machines process text—and now they’re learning to see. In this guide, we:
- Show how we build chat-capable LLMs.
- Walk through instruction finetuning.
- Use Meta’s LLaMA as a concrete example of a multimodal LLM.
- Cover key evaluation metrics and best practices.
1. From Text to Conversation
We start by teaching a model to predict words, then turn it into a helpful assistant.
1.1 Pretraining on Text
- Data: Hundreds of billions of tokens from books, articles, code.
- Task: Predict the next token in each sequence.
- Model: A stack of transformer decoder blocks with self-attention and feed-forward layers.
- Scale:
- Embedding size (
d_model
), number of layers (L
), and attention heads (H
) set the model’s capacity. - Doubling
d_model
roughly quadruples total parameters. - Leading models range from 10 B to over 1 T parameters.
- Embedding size (
- Context window:
- Early: 2 K tokens (e.g., GPT-3)
- Now: 100 K+ tokens for long documents
- Attention cost grows as $O(n^2)$, so long contexts need memory tricks.
Tip: Use mixed precision (FP16 or BF16) to save GPU memory and speed up training.
1.2 Instruction Finetuning
After pretraining, the model knows language patterns but not how to follow requests. We fix that by:
- Collecting examples
- Human-written prompts and ideal responses (translate, summarize, answer).
- Training
- Continue cross-entropy on the response, conditioning on the instruction.
- Optionally apply RLHF (reinforcement learning from human feedback).
- Result
- The model asks for clarification if unclear.
- It refuses harmful or off-limits requests.
- It shifts from “next-word guessing” to “helpful assistant.”
2. Case Study: LLaMA as a Multimodal LLM
Meta’s LLaMA family illustrates how to add vision to a text model. LLaMA 3.2 processes images and text in one network.
2.1 Why Multimodal?
- Real-world tasks often mix text and images: recipes with photos, medical reports with scans, document Q&A with figures.
- A single model that handles both can share knowledge across modes.
2.2 LLaMA 3.2 Architecture
2.2.1 Vision Encoder
- Patch embedding
- Split each image into a 32×32 grid of patches.
- Flatten each patch into a 1,280-dim vector.
- Two-stage encoder
- Local encoder (32 layers) captures textures and edges.
- Global encoder (8 layers with gated attention) builds a wider context.
- Multi-scale features
- Extract outputs from layers 3, 7, 15, 23, and 30 for richer signals.
2.2.2 Text Backbone
- Based on LLaMA 3.1
- 40 transformer layers, hidden size 4,096
- Alternates self-attention and cross-attention every 5 layers
2.2.3 Cross-Modal Bridge
- Projection layer
- Map 7,680-dim visual features into the 4,096-dim text space.
- Cross-attention
- At key layers, text tokens attend to image features.
- Gating
- Control how much visual information flows in, preventing overload.
Note: By sharing transformer blocks, LLaMA keeps parameter growth modest while gaining vision skills.
3. Evaluating LLMs: A Holistic View
We measure models on multiple fronts—accuracy, fairness, cost, and robustness.
3.1 Core Metrics
Metric | What it measures | When to use |
---|---|---|
Perplexity | How well a model predicts text | Language modeling |
ROUGE | Overlap of generated vs. reference text | Summarization, translation |
-
Perplexity
[ \text{Perplexity} = \exp\Bigl(-\tfrac{1}{D}\sum_{i=1}^D \log P(t_i \mid t_{<i})\Bigr). ]
Lower is better—fewer surprises. -
ROUGE-1
[ \frac{\text{Matched words}}{\text{Words in reference}}. ]
3.2 HELM Framework
A broad evaluation groups metrics into three pillars:
- Efficiency
- Training compute and energy use
- Inference latency and throughput
- Alignment
- Fairness across demographics
- Bias and toxicity checks
- Capability
- Task accuracy (F1, Exact Match, MRR)
- Calibration (ECE)
- Robustness to noise, paraphrases, typos
4. Best Practices and Takeaways
- Pretrain at large scale, using diverse text.
- Finetune on clear, high-quality instruction data.
- Integrate vision via cross-attention—LLaMA shows one path.
- Evaluate broadly: don’t rely on a single number.
By following this roadmap, you build LLMs that understand text, images, and user needs—while staying efficient, fair, and robust.
Enjoy Reading This Article?
Here are some more articles you might like to read next: