PaddleOCR-VL-1.6 - A visual language parsing model for document parsing launched by Baidu. - AiBoss

What is PaddleOCR-VL-1.6?

PaddleOCR-VL-1.6 is a document parsing visual language model (VLM) developed by the Baidu PaddlePaddle team, and is the latest upgrade to the PaddleOCR-VL series. With only 0.9B parameters, the model achieved a new state-of-the-art (SOTA) score of 96.33% in the authoritative OmniDocBench v1.6 benchmark test, while also setting new records on OmniDocBench v1.5 and Real5-OmniDocBench. Its text, formula, and table recognition capabilities comprehensively surpass both open-source and closed-source solutions. The model architecture is completely consistent with version 1.5, supporting zero-cost plug-and-play migration.

Main functions of PaddleOCR-VL-1.6

Text recognitionGeneral text recognition, supports 109 languages, OmniDocBench v1.6 text score 96.8.
Formula recognitionMathematical formula LaTeX recognition scored 97.5, surpassing GLM-OCR and MinerU.
Table recognitionComplex table structure analysis (including merged cells and multi-level headers), TEDS score 94.8.
Ancient Book IdentificationThe ability to recognize ancient Chinese texts and vertically formatted text has been greatly improved.
Rare character recognitionThe recognition of rare Chinese characters has been significantly enhanced.
Seal recognitionExtraction and location of text on official seals/stamps.
Chart recognitionEleven types of charts, including pie charts and line charts, are parsed into structured data.
Text detection (Spotting)Text detection in natural scenes.
Structured outputSupports exporting in Markdown, JSON, and DOCX formats.
Merging tables across pagesAutomatically identify and merge tables that span multiple pages.

Technical Principles of PaddleOCR-VL-1.6

Two-phase decoupling architectureThe model employs a two-stage design: "layout analysis + VLM recognition." The first stage uses PP-DocLayoutV3 to detect 25 types of document elements and output the reading order and coordinates. The second stage uses a VLM with 0.9B parameters to recognize each element individually. Internally, the VLM uses the NaViT dynamic resolution visual encoder to adaptively process images of different sizes, and works with the ERNIE-4.5-0.3B language model to generate structured output, avoiding the loss of small text information caused by fixed resolution.
Data-driven upgrade with zero architectural changesVersion 1.6 has the same model structure as version 1.5, and the performance leap comes entirely from data and training strategy optimization. The team analyzed the weak areas of each sub-item in OmniDocBench in version 1.5 and implemented targeted data augmentation for scenarios such as ancient books, rare characters, seals, and complex tables.
Area-aware data enhancementTo address weak areas, CV simulation distortion technology is introduced to simulate real physical distortions such as scanning, tilting, lighting, and screen capture in training data such as formulas and text. At the same time, the maximum resolution of the text discovery task is expanded to 2048×28×28 pixels, and large-scale special data of seals and ancient books are injected to significantly improve robustness in real-world scenarios.
Progressive three-stage trainingThe approach adopts a progressive scheme of "pre-training → SFT → reinforcement learning": the pre-training data is expanded from 29 million to 46 million image-text pairs; in the SFT stage, stamp recognition and text discovery tasks are added on the basis of the original OCR, table and formula tasks; finally, GRPO reinforcement learning is used to further align the output quality and achieve multi-task unification.

How to use PaddleOCR-VL-1.6

Local installation (Python):Install paddlepaddle-gpu==3.2.1(CUDA 12.6), Execute pip install -U "paddleocr[doc-parser]"It can be used after completing the environment configuration.
Command line usageRun after installation paddleocr doc_parser -i your_document.png or paddleocr doc_parser -i document.pdfIt directly outputs the parsing results and supports batch processing of single images and PDFs.
Python APIImport PaddleOCRVL Class initialization pipeline, call predict() Pass in the image path, and the result can be obtained. print() View, or use save_to_json(),save_to_markdown() Save as a structured file.
Docker Deployment (Production Environment): Pull the official mirror ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpuIt runs directly inside the container after starting, making it suitable for server deployment.
Inference service deployment:implement paddleocr genai_server One-click start of HTTP service, supporting multiple backends such as vLLM, SGLang, FastDeploy, Transformers, and llama.cpp, suitable for high-concurrency API call scenarios.

The core advantages of PaddleOCR-VL-1.6

SOTA accuracyOmniDocBench v1.6 achieved a 96.33% accuracy rate, ranking first across all dimensions including text, formulas, and tables.
Ultra-lightweightThe parameter count is 0.9B, which is much smaller than that of general-purpose large models such as Qwen3-VL-235B and GPT-5.2.
Zero-cost migrationThe architecture is completely identical to 1.5; simply replace the weights.
Real-world robustThe system has achieved state-of-the-art results in five major scenarios: scanning, distortion, screen capture, lighting changes, and tilt.
Multiple hardware supportNVIDIA GPUs (including Blackwell), Apple Silicon, Kunlun Chips, Ascend, AMD, Intel

Project address for PaddleOCR-VL-1.6

GitHub repositoryhttps://github.com/PaddlePaddle/PaddleOCR
HuggingFace model library: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6

Comparison of PaddleOCR-VL-1.6 with similar competing products

Comparison Dimensions	PaddleOCR-VL-1.6	GLM-OCR	MinerU 2.5
Developer	Baidu PaddlePaddle	Zhipu AI	Shanghai AI Lab / Tsinghua University
Parameter size	0.9B	0.9B	1.2B
OmniDocBench v1.6	96.33%	95.22%	95.75%
Text recognition	96.8	94.0	–
Formula recognition	97.5	96.5	–
Table Recognition (TEDS)	94.8	85.2	88.4
Real-world robustness	SOTA	Basic	Basic
Ancient books/Rare characters	Significantly enhanced	support	Generally
Seal recognition	Enhance	support	Not mentioned
Deployment costs	Extremely low	Extremely low	medium
Open source license	Open source and free	Open source and free	Open source and free

Application Scenarios of PaddleOCR-VL-1.6

Document digitizationConvert scanned copies of paper archives, books, and papers into Markdown or JSON structured electronic documents, supporting batch processing.
Corporate OfficeAutomatically extracts key information from contracts, invoices, reports, and approval forms, and integrates with ERP or OA systems to automate processes.
Educational ResearchIt can identify complex formulas (LaTeX output) and tabular data in academic papers to assist in literature organization and knowledge extraction.
Financial ServicesIt analyzes bank drafts, financial statements, and bank statements to enable automatic data entry and compliance auditing.
HealthcareIt allows for structured entry of medical records, examination reports, and prescriptions, and supports integration with hospital information systems.