langextract

Python ★ 37k updated 1mo ago

A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.

Python library that uses LLMs to extract structured data from unstructured text documents, with source grounding and interactive visualization.

PythonGoogle Gemini APIOpenAI APIOllamasetup: moderatecomplexity 2/5

LangExtract is a Python library from Google that uses large language models (LLMs) to pull structured information out of unstructured text documents. "Structured information" means organized, categorized data — like a table of named entities, a list of medications with dosages, or characters and their relationships — drawn from a free-form document like a clinical note, a legal contract, or a literary text. This solves the gap between the unstructured world (documents written in natural language) and the structured world (databases, spreadsheets, analytics pipelines) that applications need.

The library works by letting you describe your extraction task using plain English instructions and a few hand-crafted examples that show the model what you want. You provide a text document, your prompt description, and your examples, and LangExtract sends everything to an LLM and returns the extracted entities as structured Python objects. A key feature is precise source grounding: every extracted item is mapped back to its exact character position in the original text. This lets the library generate an interactive HTML visualization where you can see each extraction highlighted in context — making it easy to verify that the model stayed faithful to the source rather than hallucinating.

For long documents, the library handles chunking the text into manageable pieces, processing chunks in parallel, and running multiple passes to improve recall (the proportion of relevant items actually found). It supports Google Gemini models by default — including Gemini Flash and Gemini Pro — but also works with OpenAI models and local open-source models running via Ollama.

You would use LangExtract when you need to extract entities from clinical notes, radiology reports, contracts, or any domain-specific text without fine-tuning a model. The tech stack is Python, primarily targeting the Gemini API, and distributed as a pip package.

Where it fits

Extract medications and dosages from clinical notes and automatically populate a database.
Pull key terms, obligations, and parties from legal contracts for compliance review.
Identify characters, relationships, and plot events from literary texts for analysis.
Extract structured data from domain-specific documents without training a custom model.

Open on GitHub → Full breakdown on explaingit →