gitmyhub

ABot-OCR

Python ★ 37 updated 18d ago

High-precision document OCR with structured Markdown output

An AI model from a computer vision lab that converts scanned document images into structured Markdown, preserving tables as HTML and math formulas in LaTeX notation, running via a GPU inference script with weights downloaded from Hugging Face.

PythonvLLMsetup: hardcomplexity 3/5

ABot-OCR is an AI model that reads images of document pages and converts them into structured Markdown text. OCR stands for optical character recognition, which is the technology that turns images of text into actual readable text. This particular model goes further than basic OCR by also recognizing mathematical formulas, tables, and the overall layout of the document, then outputting everything in a format that preserves that structure.

The practical use case is converting scanned PDFs or photographs of documents, academic papers, or forms into text that can be edited, searched, or processed further. Instead of outputting plain unformatted text, the model produces Markdown where tables are encoded as HTML, math formulas are written in LaTeX notation, and the document structure is retained as much as possible.

To use it, you download the model weights from Hugging Face (the files are not included in this repository due to their size) and run a Python inference script. The script uses a library called vLLM to load and run the model efficiently on a GPU. You point it at a folder of images, and it writes one Markdown file per image to an output directory. Images that already have a corresponding output file are skipped, so interrupted runs can be resumed. A GPU with around 4 GB of video memory is needed, though actual requirements depend on image size and how many images you process at once.

The README is relatively sparse and still contains placeholder notes where benchmark details and training background are intended to be filled in. The benchmark figure references a dataset called OmniDocBench. The project is from a computer vision lab and cites several earlier open-source OCR projects as influences.

Where it fits