Building AI Agents with Multimodal Models: Part 3

2026/01/08

Document Intelligence: Teaching AI to Read, Understand, and Remember PDFs

This is Part 3 of a 4-part series based on learnings from NVIDIA’s “Building AI Agents with Multimodal Models” certification.


The Challenge: Documents Are Messy

Think about a typical business document. It might have:

For humans, navigating this is intuitive. But for AI, a PDF is just a jumble of pixels or raw text blobs with no inherent structure. Teaching AI to extract meaningful information from documents is one of the most practical applications of multimodal AI.

This is where Optical Character Recognition (OCR) meets Retrieval Augmented Generation (RAG) to create intelligent document processing systems.


OCR: From Pixels to Text

The Analogy: Imagine you’re teaching a child to read. First, they learn to recognize individual letters. Then words. Then sentences. Eventually, they understand that text flows in certain directions and formats.

Optical Character Recognition follows a similar journey:

  1. Image Processing: Clean up the document image (remove noise, fix rotation)
  2. Layout Detection: Find regions of text, tables, images
  3. Character Recognition: Convert pixel patterns to characters
  4. Post-Processing: Apply language models to fix errors

Modern OCR goes far beyond simple text extraction. It understands document structure.


The Document Processing Pipeline

NVIDIA’s training demonstrates a comprehensive pipeline for extracting multimodal data from PDFs. Let’s break it down.

Step 1: Document Partitioning

Before extracting content, you need to identify what’s in the document.

The Analogy: Before renovating a house, you walk through each room and catalog what’s there. “Living room has a couch, TV, and bookshelf. Kitchen has appliances and a dining table.”

Document partitioning creates an inventory of elements:

Tools like the unstructured library do this automatically, identifying element types and their locations.

Step 2: Smart Chunking

Once you have text, you need to break it into digestible pieces for the AI. But how you chunk matters enormously.

Naive Chunking (Bad Approach): Split text every 500 characters regardless of content.

Problem: You might split a sentence mid-thought, separate a header from its content, or break apart related concepts.

Chunk 1: "The quarterly revenue reached $5.2 million, an increase of 23%"
Chunk 2: "compared to the previous quarter. Key drivers included..."

Semantic Chunking (Better Approach): Split at natural boundaries like titles, section breaks, or paragraph endings.

Chunk 1: [Header: Financial Results]
         "The quarterly revenue reached $5.2 million, an increase of 23%
          compared to the previous quarter."

Chunk 2: [Header: Key Drivers]
         "Key drivers included expanded market presence and new product
          launches in the enterprise segment..."

The semantic approach preserves meaning and context. When the AI retrieves this chunk later, it gets complete thoughts.

Step 3: Table Extraction

Tables are notoriously tricky. They encode relationships through spatial position, not linear text.

The Challenge:

| Product | Q1 Sales | Q2 Sales |
|---------|----------|----------|
| Widget  | $50,000  | $65,000  |
| Gadget  | $30,000  | $45,000  |

If you just extract text left-to-right, you get: “Product Q1 Sales Q2 Sales Widget $50,000 $65,000…”

This loses all the relational information. Which number belongs to which product?

The Solution: Use specialized table extraction models that understand grid structure. NVIDIA’s pipeline uses models like Microsoft’s Table Transformer to:

  1. Detect table regions in the document
  2. Identify rows and columns
  3. Extract cell contents with their positions
  4. Convert to structured formats (HTML, JSON)

The extracted HTML preserves structure:

<table>
  <tr><td>Product</td><td>Q1 Sales</td><td>Q2 Sales</td></tr>
  <tr><td>Widget</td><td>$50,000\lt /td>\lt td>$65,000</td></tr>
</table>

Step 4: Image and Chart Extraction

Documents often contain figures that carry critical information.

The Approach:

  1. Object Detection: Use models like YOLOX to find figures, charts, and diagrams
  2. Region Extraction: Crop these regions as separate images
  3. Metadata Preservation: Keep track of page number, position, and nearby text (captions)
  4. Visual Analysis: Optionally use Vision Language Models to describe the content

This enables queries like “Show me all the architecture diagrams in this documentation.”


RAG: Retrieval Augmented Generation

Now you’ve extracted all this content. How do you make it useful?

The Analogy: Imagine you’re a researcher with a library of 10,000 books. When someone asks you a question, you don’t read all 10,000 books. You:

  1. Search the catalog for relevant books
  2. Pull those specific books off the shelf
  3. Read the relevant sections
  4. Synthesize an answer

RAG does exactly this with AI.

The RAG Pipeline

User Question
     │
     ▼
┌─────────────┐
│  Embedding  │ ← Convert question to vector
└─────────────┘
     │
     ▼
┌─────────────┐
│  Retrieval  │ ← Find similar chunks in vector database
└─────────────┘
     │
     ▼
┌─────────────┐
│   Context   │ ← Combine retrieved chunks
└─────────────┘
     │
     ▼
┌─────────────┐
│     LLM     │ ← Generate answer using context
└─────────────┘
     │
     ▼
   Answer

Step 1: Indexing (One-Time Setup)

Take all your extracted chunks and convert them to embeddings:

Chunk 1 ──> [Encoder] ──> [0.2, 0.8, 0.1, ...]
Chunk 2 ──> [Encoder] ──> [0.5, 0.3, 0.9, ...]
Chunk 3 ──> [Encoder] ──> [0.1, 0.7, 0.4, ...]
...

Store these embeddings in a vector database like Milvus, Pinecone, or FAISS.

Step 2: Retrieval (At Query Time)

When a user asks a question:

  1. Convert the question to an embedding
  2. Find the K most similar chunks (using cosine similarity)
  3. Return those chunks as context
question = "What was Q2 revenue?"
question_embedding = encoder.encode(question)
similar_chunks = vector_db.search(question_embedding, k=5)

Step 3: Generation

Feed the retrieved context plus the question to an LLM:

Context: [Retrieved chunks about Q2 revenue]
Question: What was Q2 revenue?

Answer: Based on the financial report, Q2 revenue was $65,000 for
the Widget product line and $45,000 for Gadgets, totaling $110,000.

The LLM generates an answer grounded in your actual documents, not its training data.


Object Detection with YOLOX

For intelligent document analysis, you need to detect where different elements are located.

The Model: NVIDIA provides specialized models like nv-yolox-page-elements trained specifically for document analysis.

What It Detects:

How It Works:

  1. Process each page as an image
  2. Model outputs bounding boxes with confidence scores
  3. Use boxes to crop and extract specific regions
Page Image ──> [YOLOX Model] ──> Detected Regions:
  • Table at (100, 200, 500, 400) - Confidence: 0.95
  • Chart at (100, 450, 500, 650) - Confidence: 0.89
  • Title at (50, 50, 400, 80) - Confidence: 0.97

This enables intelligent routing: text goes to OCR, tables go to table extractors, charts go to visual analysis models.


Handling Large Documents

Real documents can be hundreds of pages. Processing all at once is impractical.

The Solution: Batch processing with memory management.

# Process in batches of 10 pages
for start_page in range(0, total_pages, 10):
    end_page = min(start_page + 10, total_pages)
    batch = extract_pages(document, start_page, end_page)
    process_batch(batch)
    save_results(batch)
    clear_memory()  # Prevent memory overflow

Each batch is processed independently, results are saved, and memory is cleared before the next batch.


Practical Example: Processing a Technical Datasheet

Let’s walk through processing NVIDIA’s Grace-Blackwell datasheet (a real example from the training):

Input: 20-page PDF with specifications, architecture diagrams, and performance tables

Processing Steps:

  1. Partition: Identify 150+ elements across 20 pages
  2. Extract Text: Pull out 45 text blocks with semantic chunking
  3. Extract Tables: Identify 12 specification tables, convert to HTML
  4. Extract Figures: Locate 8 architecture diagrams
  5. Index: Embed all content into vector database
  6. Query: “What are the memory bandwidth specs?”

Result: System retrieves relevant table chunks and generates accurate answer with source citations.


Key Takeaways

  1. Document processing is inherently multimodal: Text, tables, images all carry information

  2. Smart chunking preserves meaning: Semantic boundaries beat arbitrary character limits

  3. Tables need special handling: Spatial structure encodes relationships that linear text loses

  4. Object detection enables routing: YOLOX identifies what’s where so appropriate extractors can be used

  5. RAG grounds AI in your data: Retrieved context prevents hallucination and enables factual answers

  6. Batch processing handles scale: Process large documents in manageable chunks to control memory


When to Use Document RAG


What’s Next?

In Part 4, we’ll explore the most exciting frontier: Video Understanding and Graph-RAG. You’ll learn how AI can watch, understand, and answer questions about video content, and how knowledge graphs enable complex reasoning that simple vector search cannot achieve.


This content is inspired by NVIDIA’s Deep Learning Institute course: Building AI Agents with Multimodal Models. For hands-on experience, consider enrolling in their official courses.