gitmyhub

tika

Java ★ 3.8k updated 2d ago

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Apache Tika is a Java library that reads files of many different formats and pulls out the text and metadata inside them. Feed it a PDF, a Word document, an image, an audio file, or dozens of other types, and it returns the plain text content along with information like the author, creation date, and file type. It does this by wrapping a large collection of existing document parsing libraries into one consistent interface.

Developers can use Tika by adding it as a dependency to a Java project, running it as a command-line tool, or connecting to it as a server. The quick-start example in the README is three lines of Java code: create a Tika object, point it at a file, get back a string of text.

The project requires Java 17 or later. Support for older versions ended in April 2025. Building from source uses Maven, and a Maven wrapper script is included so you do not need Maven pre-installed. Docker is used for some integration tests but is optional.

Tika is part of the Apache Software Foundation and is released under the Apache 2.0 open source license. Pre-built downloads are available from the project website and through the Maven Central package repository.