gitmyhub

paper-qa

Python ★ 8.7k updated 9d ago

High accuracy RAG for answering questions from scientific documents with citations

PaperQA2 lets you ask plain-English questions about a folder of research PDFs and get answers with exact citations pointing to the specific paper and page each claim came from, using AI-powered search rather than memorized knowledge.

PythonOpenAILiteLLMpipsetup: moderatecomplexity 2/5

PaperQA2 is a Python tool for asking questions about scientific papers and getting answers that include specific citations pointing back to the source text. You point it at a folder of PDF files, or other document types, and ask a question in plain English. It finds relevant passages, summarizes them, and produces an answer that tells you exactly which paper and which page each claim came from.

The technique behind it is called retrieval augmented generation, or RAG, which means an AI language model is combined with a search system rather than relying only on what the model has memorized. PaperQA2 adds several refinements on top of basic RAG: it can run as an agent that iterates, refining its search queries if the first results are not good enough; it fetches metadata about papers automatically, including citation counts and retraction status; and it uses an additional step called contextual summarization to improve the quality of retrieved passages before passing them to the language model. The README reports that this pipeline has exceeded human performance on benchmarks involving scientific question answering, summarization, and contradiction detection.

Installation is through pip, and basic use requires just three commands: install the package, put PDFs in a folder, and run pqa ask with your question. By default the tool uses OpenAI models for both the language model and the embedding step that finds relevant documents, but it supports a wide range of other models through a library called LiteLLM. Local models can also be used if you do not want to send data to an external service.

The tool supports PDFs, plain text files, Microsoft Office documents, HTML, and source code files. It can maintain an index of a local document collection and reuse it across sessions without reprocessing everything each time. External vector databases can be plugged in for larger collections.

PaperQA2 is developed by a research organization called FutureHouse, is open source under the Apache 2.0 license, and has an accompanying academic paper describing its architecture and benchmark results. The full README is longer than what was shown.

Where it fits