gitmyhub

dots.ocr

Python ★ 8.9k updated 2mo ago

Multilingual Document Layout Parsing in a Single Vision-Language Model

Open-weight AI model from Xiaohongshu that parses documents, PDFs, and images into structured text, tables, and diagrams locally without sending data to a third-party API.

Pythonsetup: moderatecomplexity 3/5

dots.ocr is an AI model that reads documents and extracts their content in a structured way. Unlike simpler tools that just pull out raw text, it understands page layout: it can identify headings, tables, columns, and figures, and reproduce the document in a clean, machine-readable format. It was built by the AI research team at Xiaohongshu (the Chinese social media platform known as Little Red Book).

The model handles a wide range of document types and can recognize scripts from many languages, not just Latin or Chinese text. It also goes beyond standard document parsing: it can take a chart or diagram and convert it into SVG code, parse web page screenshots, and spot text that appears in natural scenes rather than printed pages. This makes it more general than tools focused only on PDFs or scanned books.

The project has gone through several versions. The original dots.ocr model was based on a relatively small 1.7 billion parameter language model, and the team later released dots.ocr-1.5 and then rebranded it as dots.mocr. The model weights are hosted on HuggingFace and can be downloaded for local use. A live demo is available on the project's website so you can test it without any setup.

The README includes detailed benchmark comparisons against other document-parsing systems, showing how dots.mocr scores on standardized tests for academic paper parsing, table recognition, multi-column layouts, old scanned documents, and more. The numbers place it among the higher-performing models of its size class on most of these tests, though very large commercial models still score higher on some benchmarks.

If you work with a lot of documents, PDFs, or scanned files and need to extract their content programmatically, this project offers a local, open-weight model you can run without sending data to a third-party API.

Where it fits