Principles underlying SRE team work – the patterns, behaviors, and areas of concern that affect the general field of SRE operations.
The first chapter of this section, and the most important section to read if you want to get the broadest picture of what exactly SRE is doing, and how we explain it, is a risk hug. It looks at SRE through the risk lens – its assessment, management, and use of error budgets to provide useful neutral approaches to service management.
Service level goals are another basic conceptual unit for SRE. The industry usually has different chunks in terms of the general title of service level agreements, a tendency that makes it difficult to think about these concepts clearly. Service-level goals try to break away from target metrics from the agreements, examine how SRE uses each of these conditions, and provide some recommendations on how to find useful metrics for your own applications.
A work permit is one of the most important tasks of SRE, and carries the burden of work. We define labor as a value-added, nationwide and repetitive work that compares growth in scope and service.
Whether in Google or elsewhere, monitoring is a vital factor in doing the right thing in production. If you cannot monitor a service, you do not know what is going on, and if you are blind to what is happening, you cannot be reliable. Read distributed monitoring systems for follow-up recommendations and some agnostic implementation best practices.
In Google’s “Automation Evolution,” we look at SRE’s approach to automation, and we will go over some case studies on how SRE implemented automation, successfully and without success.
Most companies regard the release of engineering as a consecutive thought. However, as you will learn in “release engineering”, release engineering is not only critical to the stability of the overall system – since most outages are due to pushing some change. This is also the best way to ensure the editions are consistent.
A key principle in any effective software engineering, and not just reliability-driven engineering, simplicity is a quality that can be especially difficult after being lost. Nevertheless, as the old adage article, a complex system that works necessarily evolved from a simple system that works. Simplicity, goes into this topic in detail.
Further reading from Google SRE
Increasing product speed safely is a fundamental principle for any organization. In the film “Making Push On Green a Reality” [Kle14], released in October 2014, we show that removing humans from the release process can paradoxically reduce the burden of SREs while increasing system reliability.