24 Sep 2019

SRE Risks and Error Budgets

How the SRE discipline reduces stress on speed / stability between product teams and system operators by quantifying risk and activating error budgets. Striving for 100% availability of service is not simply impossible, it is unnecessary. Maximum stability limits the speed at which new features can be provided to users. Extreme availability yields diminishing as the user experience is controlled by less reliable components like cellular networks or WiFi. We want to reduce the risk of system failure, but we must accept the risk of providing new products and features.

How the SRE discipline reduces stress on speed / stability between product teams and system operators by quantifying risk and activating error budgets. Striving for 100% availability of service is not simply impossible, it is unnecessary. Maximum stability limits the speed at which new features can be provided to users. Extreme availability yields diminishing as the user experience is controlled by less reliable components like cellular networks or WiFi. We want to reduce the risk of system failure, but we must accept the risk of providing new products and features. In the SRE discipline, error budgets are the quantitative measurements that record how much service risk is prepared to endure. Error budgets are a by-product of agreed SLOs (service-level goals) between product owners and systems engineers. Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define "accidents" by quantifying accidents and risk. Error budgets also enforce that "change needs to be gradual," because non-gradual changes can quickly break the SLO and prevent further development in the quarter. This is why we say that SRE class implements DevOps. Embrace a risk You can expect Google to try to build 100% reliable services - services that never fail. Apparently, a certain point has passed, however, increasing reliability for service (and its users) and not better! Extreme reliability costs: Maximum stability limits how nelicited press coverage.

Site Reliability Engineer Manager SRE

In the SRE discipline, error budgets are the quantitative measurements that record how much service risk is prepared to endure. Error budgets are a by-product of agreed SLOs (service-level goals) between product owners and systems engineers. Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define “accidents” by quantifying accidents and risk. Error budgets also enforce that “change needs to be gradual,” because non-gradual changes can quickly break the SLO and prevent further development in the quarter. This is why we say that SRE class implements DevOps.

Embrace a risk

You can expect Google to try to build 100% reliable services – services that never fail. Apparently, a certain point has passed, however, increasing reliability for service (and its users) and not better! Extreme reliability costs: Maximum stability limits how new features can be developed quickly and how quickly products can be delivered to users, dramatically increasing their cost, which in turn reduces the number of features a team can afford. Furthermore, users generally do not notice the difference between high reliability and extreme service reliability, because the user experience is dominated by less reliable components such as the cellular network or device they work with. Simply put, a 99% reliable smartphone user cannot understand the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing operating time, site reliability engineering seeks to balance the risk of unavailability for rapid innovation and efficient service operations so that overall user happiness – with features, service and performance – is optimized.

Risk Management
Unreliable systems can quickly erode user confidence, so we want to reduce the chance of system failure. However, experience shows that when we build systems, the cost does not increase linearly as reliability additions – incremental improvement in reliability may cost 100 times more than the previous increment. The cost has two dimensions:

Cost of resources for unnecessary computers / computers
The cost of unnecessary equipment, for example, allows us to take offline systems for ongoing or unexpected maintenance, or provides us with space to store parity code blocks that provide minimal data compliance guarantee.

The opportunity cost
The cost that an organization incurs when it allocates engineering resources to build systems or features that reduce risk rather than features directly visible to or usable by end users. These engineers are no longer working on new features and products for end users.

At SRE we manage service reliability to a large extent by risk management. We realize the risk as a continuum. We give equal importance to understanding how Google can be more trusted and identify the appropriate level of tolerance for the services we run. This allows us to perform cost / benefit analysis to determine, for example, where in the (non-linear) risk sequence we should place search, ads, Gmail, or images. Our goal is to explicitly align the risk of taking a given service with the risk the business is willing to bear. We strive to make service reliable enough, but not more reliable than it should be. That is, when we set a 99.99% availability target, we want to exceed it, but not much: it will waste opportunities to add features to the system, clear technical debt or reduce its operating costs. In a sense, we see that the target of availability is minimum and maximum. The main advantage of this framing is that it opens an opportunity for explicit and thoughtful risk-building.Site Reliability Engineer Manager SRE

Service risk measurement
As a standard practice at Google, we usually do our best by identifying an objective metric for representing the property of a system that we want to optimize. By setting a goal, we can assess our current performance and track improvements or degradations over time. Regarding service risk, it is not immediately clear how all potential factors can be reduced to a single value. Service failures can have many potential impacts, including user dissatisfaction, injury, or loss of trust; Direct or indirect loss of income; Influence on brand or reputation; Unsolicited press coverage.

24 Sep 2019

SRE Principals

Principles underlying SRE team work – the patterns, behaviors, and areas of concern that affect the general field of SRE operations.

The first chapter of this section, and the most important section to read if you want to get the broadest picture of what exactly SRE is doing, and how we explain it, is a risk hug. It looks at SRE through the risk lens – its assessment, management, and use of error budgets to provide useful neutral approaches to service management.

sre contract services outsourcingService level goals are another basic conceptual unit for SRE. The industry usually has different chunks in terms of the general title of service level agreements, a tendency that makes it difficult to think about these concepts clearly. Service-level goals try to break away from target metrics from the agreements, examine how SRE uses each of these conditions, and provide some recommendations on how to find useful metrics for your own applications.

A work permit is one of the most important tasks of SRE, and carries the burden of work. We define labor as a value-added, nationwide and repetitive work that compares growth in scope and service.

Whether in Google or elsewhere, monitoring is a vital factor in doing the right thing in production. If you cannot monitor a service, you do not know what is going on, and if you are blind to what is happening, you cannot be reliable. Read distributed monitoring systems for follow-up recommendations and some agnostic implementation best practices.

In Google’s “Automation Evolution,” we look at SRE’s approach to automation, and we will go over some case studies on how SRE implemented automation, successfully and without success.

Most companies regard the release of engineering as a consecutive thought. However, as you will learn in “release engineering”, release engineering is not only critical to the stability of the overall system – since most outages are due to pushing some change. This is also the best way to ensure the editions are consistent.

A key principle in any effective software engineering, and not just reliability-driven engineering, simplicity is a quality that can be especially difficult after being lost. Nevertheless, as the old adage article, a complex system that works necessarily evolved from a simple system that works. Simplicity, goes into this topic in detail.

Further reading from Google SRE
Increasing product speed safely is a fundamental principle for any organization. In the film “Making Push On Green a Reality” [Kle14], released in October 2014, we show that removing humans from the release process can paradoxically reduce the burden of SREs while increasing system reliability.sre contract services

 

Site Reliability Manager. All rights reserved 2019 - 2036
Top