datahub
The Context Platform for your Data and AI Stack
An open-source data catalog platform that connects to 80+ data tools to give companies a searchable map of all their data, where it lives, who owns it, where it came from, and how it flows between systems.
DataHub is an open-source platform for keeping track of all the data assets a company uses. It was originally built at LinkedIn to handle their internal data at large scale, and it has since been adopted by thousands of organizations. The core problem it solves is that modern companies store and process data across many different tools: data warehouses, databases, business intelligence dashboards, machine learning systems, and data pipelines. Understanding what data exists, where it came from, and who is responsible for it becomes difficult when it is scattered across dozens of systems.
DataHub acts as a central catalog for all of that. It connects to your existing tools through a collection of more than 80 connectors, pulling in information about tables, columns, dashboards, pipelines, and other data objects. Once that information is collected, it builds a searchable index so that people can find the data they need, and it maps out lineage, meaning which datasets came from which other datasets and which transformations happened along the way. This helps teams understand the impact of a change before making it, and helps trace quality problems back to their source.
The platform also handles governance tasks: tracking ownership, applying tags and categories, managing data access policies, and recording an audit trail of how data has been used. These features are aimed at helping organizations comply with regulations and maintain quality standards across their data.
A recent addition is an open-source Analytics Agent that lets users ask questions about their data in plain English. The agent uses the DataHub catalog as context, generates SQL queries, runs them, and returns results along with charts. It also supports connecting to AI coding assistants like Claude Desktop or Cursor via the Model Context Protocol.
DataHub can be self-hosted or used as a managed cloud service. It is licensed under Apache 2.0. The full README is longer than what was shown.
Where it fits
- Build a searchable catalog of all your company's databases and dashboards so engineers can find the right dataset without asking on Slack.
- Trace data lineage to see exactly which upstream tables feed a broken dashboard so you can fix the root cause quickly.
- Ask questions about your data in plain English using the Analytics Agent, which generates and runs SQL on your behalf.
- Enforce data ownership and access policies across your entire data stack to help with GDPR or HIPAA compliance.