gitmyhub

clawpdf

TypeScript ★ 92 updated 11d ago

Zero-dependency PDFium WebAssembly bindings for Node and browsers.

clawpdf is a TypeScript package that lets JavaScript code work with PDF files in Node.js or the browser, without installing any extra dependencies. It bundles Google's PDFium PDF engine compiled to WebAssembly, so there is no native addon to compile and no canvas library to install.

The main things it can do are extract text from a PDF, render individual pages as PNG images, and handle password-protected files. There is an "auto" extraction mode that pulls text first and only falls back to rendering PNG images when the extracted text is too short to be useful. This is aimed at use cases where PDFs are being fed into an AI model: readable PDFs go in as text, scanned or image-heavy PDFs go in as images. An adapter function called toMessageContent can shape the output into blocks suitable for multimodal model input.

The API centers on three main functions. openPdf opens a single document and gives you access to page count, text, and per-page PNG rendering. extractPdf is a one-shot function that applies the auto fallback logic. createEngine creates a reusable PDFium instance, which the README recommends for server code so you are not spinning up a new WASM engine for each request.

A CLI is included so you can extract text or render pages directly from the terminal without writing any code. Both the Node.js and browser paths ship in the same package; the browser version pre-configures the WASM URL for bundlers, with an option to host the WASM file yourself.

Benchmarks in the README show roughly half the processing time and significantly lower memory use compared to an earlier approach tested against the same sample PDFs. Node.js 20 or later is required. The package is released under the MIT license, with upstream BSD-style and Apache 2.0 notices for the PDFium binary.