gitmyhub

kreuzberg

Rust ★ 8.5k updated 16h ago

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

A fast Rust-powered library that extracts text and data from 91+ file formats, PDFs, Office docs, images, emails, with bindings for Python, Node.js, Go, Java, and 10+ other languages, plus OCR and AI pipeline support.

RustPythonTypeScriptWebAssemblytree-sitterGoJavasetup: easycomplexity 3/5

Kreuzberg is a document processing library built in Rust that pulls text, metadata, and structured information out of over 91 file formats. PDFs, Office documents, images, HTML files, emails, archives, and academic formats are all supported. The library does this work at high speed without requiring a GPU.

The Rust core is wrapped with native language bindings for over a dozen programming languages: Python, Ruby, PHP, Elixir, R, Dart, Go, Java, Kotlin, C#, TypeScript for Node.js, WebAssembly for browsers and Cloudflare Workers, and C through an FFI interface. Each language gets its own package published to the relevant package repository (PyPI, npm, Maven Central, and so on), so installation follows the usual conventions for your language.

Beyond plain text extraction, Kreuzberg can parse code files and pull out functions, classes, imports, and docstrings from 306 programming languages using tree-sitter, a widely used parsing library. It also supports OCR (reading text from scanned documents or images) through several backends including Tesseract, PaddleOCR, EasyOCR, and vision-capable AI models from providers like OpenAI, Anthropic, and Google. For AI pipelines, it includes a wire format called TOON that produces 30-50% fewer tokens than JSON when passing extracted content to language models.

The library can be used in four ways: as a code library you call directly, as a command-line tool, as a REST API server, or as an MCP server (a protocol for connecting tools to AI assistants). It uses streaming parsers to handle very large files without loading them entirely into memory.

The project is licensed under the Elastic License 2.0. Documentation is at kreuzberg.dev and a live demo is available online.

Where it fits