gitmyhub

chandra

Python ★ 11k updated 1mo ago

OCR model that handles complex tables, forms, handwriting with full layout.

An OCR tool that reads text from images and PDF files and converts it into Markdown, HTML, or JSON, accurately handling tables, handwriting, math formulas, and over 90 languages.

PythonPyTorchHuggingFacevLLMsetup: moderatecomplexity 3/5

Chandra is an OCR model, meaning it reads text from images and PDF files and converts that content into structured digital formats. OCR stands for optical character recognition, the technology that lets a computer extract the words from a scanned document or photograph. Chandra goes beyond basic text extraction by preserving the layout of the original document and outputting the result as Markdown, HTML, or JSON.

What sets it apart is its handling of difficult content types. It accurately processes complex tables, filled-in forms including checkboxes, handwritten text, mathematical formulas, charts, and documents in over 90 languages. The README includes side-by-side benchmark comparisons showing its accuracy against other publicly available OCR tools on multilingual documents.

To use it, you install the Python package with pip, then run a command-line tool pointing it at a file or folder. It supports two ways of running the underlying AI model: one uses HuggingFace, a popular AI model platform that requires the PyTorch library installed locally, and the other uses vLLM, a server-based approach that is lighter to set up. A browser-based demo app is also included for trying it out on single pages.

For each processed document, Chandra produces a folder of output files: a Markdown version, an HTML version, a JSON metadata file, and any images extracted from the document. You can control which page range to process, how many pages to handle in parallel, and whether to include page headers and footers in the output.

The code is released under the Apache 2.0 license. The underlying model weights use the OpenRAIL-M license. A managed cloud version with higher accuracy, batch processing at scale, and SOC 2 Type 2 compliance is available from Datalab, the company behind the project. Commercial self-hosting requires a separate license.

Where it fits