Datalab ORG

@datalab-to ·United States of America ·www.datalab.to

Developing state of the art document intelligence models.

12 repos
733 followers
0 following

Python 82%
Shell 9%
HTML 9%

All public repos (12)

Show forks Show archived

marker

Convert PDF to markdown + JSON quickly with high accuracy

Marker converts PDFs, Word docs, PowerPoints, spreadsheets, and EPUBs into clean Markdown, JSON, or HTML using ML models that understand document layout, so tables, equations, and multi-column text come out correctly instead of scrambled.

Python ★ 36k 14d ago
Explain →
surya

OCR, layout analysis, reading order, table recognition in 90+ languages

A Python toolkit that converts scanned documents and images into machine-readable text across 90-plus languages, and also extracts tables, page structure, reading order, and math formulas.

Python ★ 21k 8d ago
Explain →
chandra

OCR model that handles complex tables, forms, handwriting with full layout.

An OCR tool that reads text from images and PDF files and converts it into Markdown, HTML, or JSON, accurately handling tables, handwriting, math formulas, and over 90 languages.

Python ★ 11k 1mo ago
Explain →
pdftext

Extract structured text from pdfs quickly

Python ★ 700 10d ago
Explain →
lift

Extract structured data from documents quickly and accurately.

Python ★ 294 1d ago
Explain →
sdk

No description.

Python ★ 11 5d ago
Explain →
docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

★ 11 1y ago
Explain →
datalab-on-prem

Scripts to run Datalab's self-service on-prem container

Shell ★ 9 9d ago
Explain →
inference-mirror

No description.

Python ★ 4 10mo ago
Explain →
pykatex

No description.

Python ★ 3 4mo ago
Explain →
results

No description.

HTML ★ 2 2mo ago
Explain →
oss_container

No description.

Python ★ 1 8mo ago
Explain →