Large Language Models: From Architecture to Evaluation

Large Language Models (LLMs) have transformed how machines process text—and now they’re learning to see. In this guide, we:

1. From Text to Conversation

We start by teaching a model to predict words, then turn it into a helpful assistant.

Data: Hundreds of billions of tokens from books, articles, code.
Task: Predict the next token in each sequence.
Model: A stack of transformer decoder blocks with self-attention and feed-forward layers.
Scale:
- Embedding size (d_model), number of layers (L), and attention heads (H) set the model’s capacity.
- Doubling d_model roughly quadruples total parameters.
- Leading models range from 10 B to over 1 T parameters.
Context window:
- Early: 2 K tokens (e.g., GPT-3)
- Now: 100 K+ tokens for long documents
- Attention cost grows as $O(n^2)$, so long contexts need memory tricks.

Tip: Use mixed precision (FP16 or BF16) to save GPU memory and speed up training.

After pretraining, the model knows language patterns but not how to follow requests. We fix that by:

Collecting examples
- Human-written prompts and ideal responses (translate, summarize, answer).
Training
- Continue cross-entropy on the response, conditioning on the instruction.
- Optionally apply RLHF (reinforcement learning from human feedback).
Result
- The model asks for clarification if unclear.
- It refuses harmful or off-limits requests.
- It shifts from “next-word guessing” to “helpful assistant.”

Meta’s LLaMA family illustrates how to add vision to a text model. LLaMA 3.2 processes images and text in one network.

Real-world tasks often mix text and images: recipes with photos, medical reports with scans, document Q&A with figures.
A single model that handles both can share knowledge across modes.

Patch embedding
- Split each image into a 32×32 grid of patches.
- Flatten each patch into a 1,280-dim vector.
Two-stage encoder
- Local encoder (32 layers) captures textures and edges.
- Global encoder (8 layers with gated attention) builds a wider context.
Multi-scale features
- Extract outputs from layers 3, 7, 15, 23, and 30 for richer signals.

Projection layer
- Map 7,680-dim visual features into the 4,096-dim text space.
Cross-attention
- At key layers, text tokens attend to image features.
Gating
- Control how much visual information flows in, preventing overload.

Note: By sharing transformer blocks, LLaMA keeps parameter growth modest while gaining vision skills.

We measure models on multiple fronts—accuracy, fairness, cost, and robustness.

Metric	What it measures	When to use
Perplexity	How well a model predicts text	Language modeling
ROUGE	Overlap of generated vs. reference text	Summarization, translation

Perplexity
[ \text{Perplexity} = \exp\Bigl(-\tfrac{1}{D}\sum_{i=1}^D \log P(t_i \mid t_{<i})\Bigr). ]
Lower is better—fewer surprises.
ROUGE-1
[ \frac{\text{Matched words}}{\text{Words in reference}}. ]

A broad evaluation groups metrics into three pillars:

Efficiency
- Training compute and energy use
- Inference latency and throughput
Alignment
- Fairness across demographics
- Bias and toxicity checks
Capability
- Task accuracy (F1, Exact Match, MRR)
- Calibration (ECE)
- Robustness to noise, paraphrases, typos

By following this roadmap, you build LLMs that understand text, images, and user needs—while staying efficient, fair, and robust.