awesome-sre

★ 13k updated 9mo ago

A curated list of Site Reliability and Production Engineering resources.

A hand-curated collection of links to articles, talks, books, podcasts, and tools covering Site Reliability Engineering, organized by topic for beginners and experienced practitioners alike.

setup: easycomplexity 1/5

Awesome SRE is a curated list of resources about Site Reliability Engineering and Production Engineering. It does not contain software. It is a long, organized collection of links that point to articles, talks, books, podcasts, and tools gathered from across the web. Lists like this are common on GitHub and usually carry the word "awesome" in their name to signal that they are hand picked rather than automatically generated.

The README opens by answering the question of what Site Reliability Engineering is, using a quote from Ben Treynor Sloss of Google, who founded the discipline. He describes it as what happens when you ask a software engineer to design an operations function. In plain terms, it is the practice of keeping large online services running reliably by writing software and following engineering methods, rather than handling outages by hand.

The bulk of the page is a table of contents that splits the links into many themed sections. These include Culture, Education, and Books for newcomers learning the field; Reliability, Monitoring and Alerting, On-Call, and Post-Mortem for the day to day work of keeping systems healthy and reviewing failures; and Capacity Planning, Service Level Agreements, and Performance for planning and measuring how a service behaves. Further sections gather blogs, newsletters, conferences, Twitter accounts, podcasts, and a set of SRE tools.

Each section is a list of outbound links, many of them to conference talks, company engineering blogs, and well known industry presentations from organizations such as Google, Facebook, Netflix, Uber, and Dropbox. So the repository works as a reading and viewing guide: a starting map for someone who wants to learn how reliability is handled at scale, or an experienced engineer looking for deeper material.

The page notes that contributions are welcome and points to a separate contribution guide for anyone who wants to add a resource. The full README is longer than what was shown.

Where it fits

Find curated conference talks and engineering blog posts to learn what Site Reliability Engineering means in day-to-day practice.
Discover monitoring, alerting, and on-call tools used by companies like Google, Netflix, and Uber.
Build a structured reading plan for a new SRE hire using the organized topic sections on reliability and post-mortems.

Open on GitHub → Full breakdown on explaingit →