olmocr
Toolkit for linearizing PDFs for LLM datasets/training
olmOCR converts scanned PDFs and image documents into clean plain text or Markdown using a 7-billion-parameter vision model, handling tables, equations, multi-column layouts, and handwriting that classical OCR misses.
olmOCR is a toolkit from Allen AI for converting PDFs and other image-based document formats — PNG and JPEG scans included — into clean, readable plain text or Markdown. Its stated purpose is to linearize PDFs so the text inside them can be used as training data for large language models, but it works just as well as a general-purpose OCR system for anyone who needs a high-quality text version of a document.
The way it works is that the toolkit drives a 7-billion-parameter vision-language model — a neural network that looks at the rendered image of a page and writes out the text it sees. Because it is a vision model rather than a traditional OCR engine, it handles things classical OCR struggles with: equations, tables, handwriting, multi-column layouts, figures with captions, and insets. It detects and strips out repeating headers and footers and tries to emit text in a natural reading order. The project ships its own benchmark, olmOCR-Bench, with over 7,000 test cases across 1,400 documents so you can compare its accuracy against alternatives. Because the model is large, it needs a recent NVIDIA GPU with at least 12 GB of memory; smaller setups can call out to a remote vLLM server instead.
Someone would use this if they have a pile of scanned reports, research papers, or old documents and want machine-readable text out of them — either to feed into an LLM pipeline, to make them searchable, or to extract tables and equations cleanly. It is written in Python, depends on poppler-utils for PDF rendering, and the team quotes a cost of under $200 per million pages converted. An online demo lives at olmocr.allenai.org. The full README is longer than what was provided.
Where it fits
- Convert a folder of scanned research papers into Markdown so they can be searched or fed into an LLM pipeline
- Extract tables and equations from PDF reports without losing their structure
- Pre-process millions of documents as LLM training data at under $200 per million pages
- Make historical or handwritten scanned records machine-readable for downstream analysis