This is my list of resources for folks who want to learn more about site reliability engineering, embracing risk, and building observable systems.

Last updated 2020-08-27.

  • Site Reliability Engineering: Embracing Risk: The Site Reliability Engineering book outlines the Google Way for keeping the lights on. It's an excellent book and you should read all of it if you're interested in building reliable software. This chapter is a great introduction to the idea of "one to zero:" understanding that components of a system are inherently unreliable, rather than assuming the building blocks you put together will always work.
  • Friday Deploy Freezes Are Exactly Like Murdering Puppies: Charity Majors is the CTO of Honeycomb, where they work hard to make your systems observable so you can make them reliable. (Follow her on Twitter.) This blog post discusses why deployments are the heartbeat of your company, and why good judgment matters more than prescriptive policies.
  • Deploys: It’s Not Actually About Fridays: Charity writes about how system reliability and rapid, frequent deployments are tightly linked, and how a culture of observability-driven development is key to the success of your systems and your organization.
  • Shipping Software Should Not Be Scary: Charity discusses ownership and agency surrounding code and systems, why your senior engineers must be able to deploy and debug their own code, and why nothing is production except production.
  • The Phoenix Project: In this novel, an IT manager is promoted to CTO just in time to take the flak for a big-bang system rewrite and catastrophic production deploy. This book is required reading for anyone looking to manage an engineering organization. It discusses the ways in which your project flow can be seen as an assembly line in a factory, and how disruptions to your flow make it difficult to see the true causes behind your group's failure to deliver.