gitmyhub

tabula

CSS ★ 7.4k updated 1y ago

Tabula is a tool for liberating data tables trapped inside PDF files

A desktop application that extracts data tables from text-based PDF files and saves them as CSV or spreadsheet-ready data, running locally in your browser so your files never leave your machine.

JavaCSSDockersetup: easycomplexity 2/5

Tabula is a desktop application that extracts data tables from PDF files and converts them into spreadsheet-friendly formats like CSV. If you have ever received a PDF containing a table of numbers or a data report and needed that information in a spreadsheet but found copying it out was impossible or produced garbled results, Tabula addresses exactly that problem. You upload the PDF, draw a selection box around the table you want, and Tabula pulls out the rows and columns as structured data you can open in Excel or import into a database.

The application runs locally on your machine and works through a browser interface. After launching it, a web page opens at a local address (127.0.0.1:8080) where you do all the work. Your files never leave your computer, which matters when working with confidential documents. The README does note two small exceptions: the app makes a request to check for newer versions and sends a usage count to a statistics counter, both of which can be disabled with command-line flags if needed.

Tabula only works with text-based PDFs, not scanned images. A quick test is whether you can click and drag to select text in the PDF using a standard PDF viewer. If you can, Tabula should be able to read it. Scanned pages that contain pictures of text require a separate optical character recognition step before Tabula can help.

Installation is available as a packaged app for Windows and macOS, a snap package for Linux, a plain JAR file runnable with Java on any platform, or via Docker Compose. Java 7 or newer is required. A separate command-line library called tabula-java handles the underlying extraction logic and continues to receive occasional updates from the community.

The README opens with a note that Tabula is a volunteer project with no active paid development at this time, and the end-user application here is unlikely to see near-term updates.

Where it fits