gitmyhub

zerox

TypeScript ★ 12k updated 1y ago

OCR & Document Extraction using vision models

Zerox converts PDFs and documents into clean Markdown or structured data by rendering each page as an image and sending it to an AI vision model, handling complex layouts, tables, and charts that standard text extraction misses.

TypeScriptPythonNode.jssetup: moderatecomplexity 3/5

Zerox is a library for turning documents into text that AI systems can easily read and work with. The core problem it solves is that PDFs and other document formats often have complex layouts, tables, charts, and mixed content that traditional text extraction tools struggle to handle accurately. Zerox gets around this by converting each page of a document into an image and then sending those images to an AI vision model, which reads the visual content and returns it as Markdown, a simple text format that preserves headings, tables, and lists.

The workflow is straightforward: you point the library at a file (PDF, Word document, or image), it converts the file into a sequence of page images, sends each image to an AI model with a request to describe the content as Markdown, and then combines all the responses into a single output. You can also use it to extract structured data by providing a schema, which tells the AI exactly which fields to pull from the document and what format they should be in.

Zerox is available as both a Node.js package (installed via npm) and a Python package (installed via pip). Both versions support several AI providers including OpenAI, Azure OpenAI, AWS Bedrock, and Google Gemini. The Node.js version has a few additional features not yet in the Python version, such as structured per-page extraction, automatic orientation correction, and edge trimming. Processing multiple pages in parallel is supported in both versions via a concurrency setting.

Installation requires a couple of supporting tools for the PDF conversion step. The Node.js version needs GraphicsMagick and Ghostscript; the Python version needs Poppler. Both of these are standard open-source utilities available through package managers on most operating systems.

The README includes example code showing how to call the library with a file URL or a local path, a full list of configuration options, and a sample of the structured output the library returns for each page. A hosted demo is available on the Omni AI website for trying out the OCR without installing anything.

Where it fits