Why a 0.9B VLM can be a serious OCR engine

2026/01/03

In this post, I will discuss PaddleOCR-VL, focusing on what is important for OCR and document parsing: stable layout, high-resolution text capture, low error rates, and fast deployment.The paper’s main claim is simple but important: you can get state of the art document parsing with an ultra compact vision language model, if you design the system around the real constraints of OCR.

For a classic OCR, the stack consists on mechanisms to detect regions with text, recognize the text using algorithms like CTC, with bolt on rules for tables, figures and so on.A VLM changes the contract. Instead of predicting characters from a section, you probe “Given pixels, generate a sequence that encodes the content I want”. I was quick to build a parallel with image captioning tasks but the more we dive in, a captioning task can miss a few axis on the tick labels and still say chart with sales rising and be accurate. But for OCR, the expectation is lossless meaning if you miss a character in a number it can break retrieval, matching and QA.

The core architectural idea - decouple layout from recognition

The system has three stages

  1. PP-DocLayoutV2 for layout detection and reading order

  2. PaddleOCR-VL-0.9B for element level recognition

  3. A lightweight post step builds Markdown and JSON outputs

The paper’s position is: do not ask the VLM to solve layout implicitly through generation. Make layout explicit with a fast detector plus ordering network, then let the VLM do what it is best at: recognition.

Now if you are interested lets dig deep in to each of those stage.

Stage 1: PP-DocLayoutV2

PP-DocLayoutV2 combines RT-DETR for detecting and classifying layout elements and a lightweight pointer network with 6 transformer layers for reading order prediction

The ordering part has details that matter:

This is the backbone of the system. If reading order is wrong, our OCR can be perfect and our parsed document is still unusable.

Stage 2: PaddleOCR-VL-0.9B

PaddleOCR-VL-0.9B follows a LLaVA inspired structure: vision encoder, projector, language model.Instead of fixed resolution resizing or tiling, the paper uses native dynamic high resolution preprocessing and a NaViT style encoder initialized from Keye-VL, designed to support native resolution inputs without distortion.The authors claim this yields fewer hallucinations and stronger performance on text heavy tasks.This is a big deal for dense documents and drawings, where tiny glyph details decide correctness.

The projector is a randomly initialized 2 layer MLP with GELU, using a merge size of 2 to bridge vision features into the language embedding space efficiently.In plain terms: reduce the token burden before the decoder pays attention to everything.Autoregressive decoding cost is tied to decoder size. The paper explicitly chooses ERNIE-4.5-0.3B for inference efficiency and adds 3D-RoPE for positional representation.The element recognizer is also built via post adaptation using pretrained weights: Keye-VL for the vision side and ERNIE-4.5-0.3B for the language side.

Stage 3: Post processing

After layout and element recognition, PaddleOCR-VL runs a lightweight post processing module that aggregates outputs from both stages and formats the final result into structured Markdown and JSON.

This is where the system becomes a document parser instead of a bag of OCR strings.What this stage effectively does, based on the paper’s description, is:

One way to think about it is s that Stage 2 gives you “content”, Stage 3 gives you “a document”.If you care about RAG, this stage is not optional. The paper describes document parsing as a foundation for retrieval and downstream LLM use, especially when combined with RAG systems.

Training Approach

The VLM training is two stage:

The paper also describes a large scale data construction pipeline: over 30M samples collected via public acquisition and synthesis, refined using prompt driven labeling with larger models, plus cleaning to remove low quality or hallucinated annotations.

Inference

PaddleOCR-VL is also designed to run fast end to end. The paper describes multi threading asynchronous execution split into three parallel stages:

Data flows through queues. VLM batching triggers when the queue hits a threshold or when items have waited too long, so blocks across different pages can be aggregated for better parallelism.On their end to end benchmark, they report that with FastDeploy the system achieves 53.1 percent higher page throughput and 50.9 percent higher token throughput than MinerU2.5. In my experience, I got a throughput of 45s per a page of engineering drawing.

I used PaddleOCR-VLM for extracting key manufacturing information from engineering drawings, in my analysis the advantages of this model are that

If you want to learn about using this model for Engineering drawings, review my blog here - OCR on Engineering Drawings with a 0.9B Vision-Language Model