surya

Python ★ 21k updated 8d ago

OCR, layout analysis, reading order, table recognition in 90+ languages

A Python toolkit that converts scanned documents and images into machine-readable text across 90-plus languages, and also extracts tables, page structure, reading order, and math formulas.

PythonPyTorchStreamlitsetup: moderatecomplexity 3/5

Surya is a Python toolkit for extracting text and understanding the structure of documents. Optical character recognition (OCR) converts images of text — scanned pages, photos of documents, PDFs — into machine-readable text. Surya does this across more than 90 languages and benchmarks competitively against commercial cloud OCR services.

Beyond basic text extraction, Surya offers several complementary capabilities. Layout analysis identifies the structural regions of a page: headers, body text, tables, images, and other zones. Reading order detection determines the logical sequence in which regions should be read, which is important for multi-column layouts or complex documents like scientific papers. Table recognition locates rows and columns within tables so structured data can be extracted accurately. It also supports LaTeX OCR for recognizing mathematical formulas and equations.

The tool works on a variety of real-world document types including scanned forms, academic papers, newspaper pages, textbooks, and presentations in languages such as Japanese, Chinese, Arabic, and Hindi.

Installation is via pip (pip install surya-ocr) and the model weights download automatically the first time you run it. It includes a graphical interactive app built with Streamlit for trying it on images or PDFs without writing code. The library is written in Python and uses PyTorch as its deep learning backend. For personal, research, and early-stage startup use the model weights are free; broader commercial use requires a license from Datalab, the company behind the project.

Where it fits

Convert a folder of scanned PDF invoices into searchable text and extract table data from each page.
Build a document processing pipeline that identifies headers, body text, and tables in academic papers and returns them in correct reading order.
Recognize and extract LaTeX math equations from images of textbook pages or scientific papers.
Process multi-column layouts or documents in Japanese, Chinese, Arabic, or Hindi and get correctly ordered machine-readable text.

Open on GitHub → Full breakdown on explaingit →