awesome-database-learning
A list of learning materials to understand databases internals
A curated reading list of research papers, university courses, and blog posts covering database internals, from query optimizers and storage engines to distributed consensus and transaction handling.
This repository is a reading list for people who want to understand how databases work on the inside. It was put together by PingCAP, the company behind the TiDB database, and is aimed at engineers who want to go deeper than just using databases to actually understanding how they are built.
The list is organized by topic rather than by difficulty. Each section covers a specific component of a database system, such as query optimization, transaction handling, storage engines, data replication, or consensus algorithms. Within each topic, you will find links to research papers, university course materials, blog posts, and recorded talks. The papers include many classic publications from database conferences going back to the 1970s, alongside more recent work on distributed systems.
Some sections are highly technical. Query optimization, for example, covers papers on optimizer frameworks like Volcano and Cascades, which are the architectures that real production databases use to turn a SQL query into an efficient execution plan. The storage section covers data structures like B-trees and log-structured merge trees, which control how data is physically written to disk. There are also sections on concurrency control, network protocols, benchmarking, and formal verification using a specification language called TLA+.
There is no code to run in this repository. It is purely a reference collection. A number of links are in Chinese, reflecting the original audience, but many of the papers and course materials are in English. For someone who wants to go from knowing how to write SQL queries to understanding what happens underneath when those queries run, this list provides a structured path through the academic and engineering literature on the subject.
Where it fits
- Follow a structured path from knowing SQL to understanding what the database actually does when your query runs.
- Find the original papers on B-trees, LSM trees, or log-structured storage before implementing your own storage engine.
- Study Volcano and Cascades optimizer frameworks to understand how production databases turn SQL into execution plans.
- Research distributed consensus algorithms like Raft or Paxos from primary academic sources before building a distributed system.