dots.ocr

Python ★ 8.9k updated 2mo ago

Multilingual Document Layout Parsing in a Single Vision-Language Model

Open-weight AI model from Xiaohongshu that parses documents, PDFs, and images into structured text, tables, and diagrams locally without sending data to a third-party API.

Pythonsetup: moderatecomplexity 3/5

dots.ocr is an AI model that reads documents and extracts their content in a structured way. Unlike simpler tools that just pull out raw text, it understands page layout: it can identify headings, tables, columns, and figures, and reproduce the document in a clean, machine-readable format. It was built by the AI research team at Xiaohongshu (the Chinese social media platform known as Little Red Book).

The model handles a wide range of document types and can recognize scripts from many languages, not just Latin or Chinese text. It also goes beyond standard document parsing: it can take a chart or diagram and convert it into SVG code, parse web page screenshots, and spot text that appears in natural scenes rather than printed pages. This makes it more general than tools focused only on PDFs or scanned books.

The project has gone through several versions. The original dots.ocr model was based on a relatively small 1.7 billion parameter language model, and the team later released dots.ocr-1.5 and then rebranded it as dots.mocr. The model weights are hosted on HuggingFace and can be downloaded for local use. A live demo is available on the project's website so you can test it without any setup.

The README includes detailed benchmark comparisons against other document-parsing systems, showing how dots.mocr scores on standardized tests for academic paper parsing, table recognition, multi-column layouts, old scanned documents, and more. The numbers place it among the higher-performing models of its size class on most of these tests, though very large commercial models still score higher on some benchmarks.

If you work with a lot of documents, PDFs, or scanned files and need to extract their content programmatically, this project offers a local, open-weight model you can run without sending data to a third-party API.

Where it fits

Extract tables and headings from scanned PDFs into machine-readable output without a paid OCR API.
Parse web page screenshots or charts into clean text and SVG code for further processing.
Recognize text in natural scene photos where it appears on signs, products, or backgrounds.
Run document parsing locally to keep sensitive files off third-party servers using the downloaded model weights.

Open on GitHub → Full breakdown on explaingit →