marker

Python ★ 36k updated 14d ago

Convert PDF to markdown + JSON quickly with high accuracy

Marker converts PDFs, Word docs, PowerPoints, spreadsheets, and EPUBs into clean Markdown, JSON, or HTML using ML models that understand document layout, so tables, equations, and multi-column text come out correctly instead of scrambled.

PythonMachine LearningOCRGemini APILaTeXGPU/CUDAApple MPSsetup: moderatecomplexity 3/5

Marker is a Python library that converts documents — primarily PDFs but also PowerPoint files, Word documents, spreadsheets, HTML pages, and EPUBs — into structured text formats like Markdown, JSON, and HTML. The core problem it addresses is that PDFs are notoriously difficult to extract useful text from: they encode content as positioned drawing instructions rather than semantic text, which means tables get scrambled, equations become gibberish, multi-column layouts get merged incorrectly, and headers and footers pollute the content. Marker uses machine learning models specifically trained for document layout understanding to handle these challenges.

Under the hood, Marker runs a pipeline of processors. A layout detection model identifies what kind of block each region of a page is: body text, table, figure, equation, code block, or heading. An OCR model converts scanned or image-based content to text. Specialized models then format tables into Markdown table syntax, convert mathematical equations to LaTeX notation, and extract image files. The output preserves the document's logical structure rather than just dumping raw text.

For even higher accuracy, Marker has a hybrid mode where you pass the structured output through a large language model like Gemini, which can merge tables that span pages, improve equation handling, and extract structured values from forms.

You would use Marker when building a document ingestion pipeline for a RAG (retrieval-augmented generation) system, when digitizing research papers or technical manuals, or when you need to extract structured data from legacy PDF-based reports. It runs on GPU, CPU, or Apple's MPS accelerator. The code is licensed under GPL and the model weights under a modified open license that allows research and personal use freely; commercial use above certain revenue thresholds requires a separate license.

Where it fits

Build a document ingestion pipeline that feeds clean text from PDFs into a RAG or AI chatbot system.
Digitize scanned research papers or technical manuals, preserving tables and equations in readable form.
Extract structured data from legacy PDF-based reports or government forms.
Convert PowerPoint or Word files to Markdown for use in a knowledge base or documentation site.

Open on GitHub → Full breakdown on explaingit →