In this post, I will discuss PaddleOCR-VL, focusing on what is important for OCR and document parsing: stable layout, high-resolution text capture, low error rates, and fast deployment.The paper’s main claim is simple but important: you can get state of the art document parsing with an ultra compact vision language model, if you design the system around the real constraints of OCR.
For a classic OCR, the stack consists on mechanisms to detect regions with text, recognize the text using algorithms like CTC, with bolt on rules for tables, figures and so on.A VLM changes the contract. Instead of predicting characters from a section, you probe “Given pixels, generate a sequence that encodes the content I want”. I was quick to build a parallel with image captioning tasks but the more we dive in, a captioning task can miss a few axis on the tick labels and still say chart with sales rising and be accurate. But for OCR, the expectation is lossless meaning if you miss a character in a number it can break retrieval, matching and QA.
The core architectural idea - decouple layout from recognition
The system has three stages
PP-DocLayoutV2 for layout detection and reading order
PaddleOCR-VL-0.9B for element level recognition
A lightweight post step builds Markdown and JSON outputs
The paper’s position is: do not ask the VLM to solve layout implicitly through generation. Make layout explicit with a fast detector plus ordering network, then let the VLM do what it is best at: recognition.

Now if you are interested lets dig deep in to each of those stage.
Stage 1: PP-DocLayoutV2
PP-DocLayoutV2 combines RT-DETR for detecting and classifying layout elements and a lightweight pointer network with 6 transformer layers for reading order prediction

The ordering part has details that matter:
it embeds proposals with absolute 2D positional encodings and class label embeddings
it adds a geometric bias mechanism inspired by Relation-DETR to model pairwise geometry
it predicts an N by N pairwise ordering matrix
it recovers a consistent reading order with a deterministic “win accumulation” decoding algorithm
This is the backbone of the system. If reading order is wrong, our OCR can be perfect and our parsed document is still unusable.
Stage 2: PaddleOCR-VL-0.9B
PaddleOCR-VL-0.9B follows a LLaVA inspired structure: vision encoder, projector, language model.Instead of fixed resolution resizing or tiling, the paper uses native dynamic high resolution preprocessing and a NaViT style encoder initialized from Keye-VL, designed to support native resolution inputs without distortion.The authors claim this yields fewer hallucinations and stronger performance on text heavy tasks.This is a big deal for dense documents and drawings, where tiny glyph details decide correctness.

The projector is a randomly initialized 2 layer MLP with GELU, using a merge size of 2 to bridge vision features into the language embedding space efficiently.In plain terms: reduce the token burden before the decoder pays attention to everything.Autoregressive decoding cost is tied to decoder size. The paper explicitly chooses ERNIE-4.5-0.3B for inference efficiency and adds 3D-RoPE for positional representation.The element recognizer is also built via post adaptation using pretrained weights: Keye-VL for the vision side and ERNIE-4.5-0.3B for the language side.
Stage 3: Post processing
After layout and element recognition, PaddleOCR-VL runs a lightweight post processing module that aggregates outputs from both stages and formats the final result into structured Markdown and JSON.
This is where the system becomes a document parser instead of a bag of OCR strings.What this stage effectively does, based on the paper’s description, is:
follow the reading order predicted by PP-DocLayoutV2
place each recognized element back into a page level representation
serialize the page into Markdown for human readable output
serialize the same content into JSON for programmatic use
One way to think about it is s that Stage 2 gives you “content”, Stage 3 gives you “a document”.If you care about RAG, this stage is not optional. The paper describes document parsing as a foundation for retrieval and downstream LLM use, especially when combined with RAG systems.
Training Approach
The VLM training is two stage:
Stage 1 alignment on 29M image text pairs
Stage 2 instruction fine tuning on 2.7M samples
The paper also describes a large scale data construction pipeline: over 30M samples collected via public acquisition and synthesis, refined using prompt driven labeling with larger models, plus cleaning to remove low quality or hallucinated annotations.
Inference
PaddleOCR-VL is also designed to run fast end to end. The paper describes multi threading asynchronous execution split into three parallel stages:
data loading
layout model processing
VLM inference
Data flows through queues. VLM batching triggers when the queue hits a threshold or when items have waited too long, so blocks across different pages can be aggregated for better parallelism.On their end to end benchmark, they report that with FastDeploy the system achieves 53.1 percent higher page throughput and 50.9 percent higher token throughput than MinerU2.5. In my experience, I got a throughput of 45s per a page of engineering drawing.
I used PaddleOCR-VLM for extracting key manufacturing information from engineering drawings, in my analysis the advantages of this model are that
Stage 1 can isolate title blocks, revision tables, notes, and callouts so Stage 2 never has to guess what region matters
Stage 2 can run at high resolution on tight crops, which is exactly what tiny labels need
Stage 3 can output clean Markdown for inspection and JSON for downstream matching
If you want to learn about using this model for Engineering drawings, review my blog here - OCR on Engineering Drawings with a 0.9B Vision-Language Model