24 Sep 2019

Hope is not a strategy. You must have competent SRE

 

Hope is not a strategy.

Saying traditional SRE

It is a universally recognized truth that systems do not run themselves. How, then, should a system run – especially a complex, large-scale computing system?

Sysadmin access to service management
Historically, companies have employed system administrators to operate complex computing systems.

This system administrator, or sysadmin, involves assembling existing software components and deploying them to collaborate on service production. Sysadmins are then tasked to run the service and respond to events and updates as they occur. As the system grows in complexity and volume of traffic, generating a parallel increase in events and updates, the sysadmin team grows to absorb the extra work. Because the sysadmin role requires a much different set of skills than that required by product developers, developers and sysadmins are divided into separate teams: “development” and “operations” or “op”.

The service management system model has several advantages. For companies deciding how to operate and adapt a service, this approach is relatively easy to apply: As a recognized industry paradigm, there are many examples that can be learned and emulated. A relevant talent pool is already widely available. A variety of existing tools, software components (off the shelf or otherwise) and integration companies are available to run the complex systems, so the beginning sysadmin team does not have to reinvent the wheel and design a system from scratch.

The sysadmin approach and the split development / options have a number of drawbacks and pitfalls. These are broadly divided into two categories: direct costs and indirect costs.

Direct costs are not subtle and are not vague. Running a service with a team that relies on manual intervention for both change management and event management becomes costly as service and / or service traffic increases, as staff size necessarily means system-generated load.

The indirect costs of development / option splitting can be subtle, but are usually more expensive to the organization than the direct costs. These costs are due to the fact that the two teams are quite different in background, skill set and incentives. They use a different vocabulary to describe situations; They make different assumptions about risk and options for technical solutions; They have different assumptions about the target level of product stability. Splitting between teams can easily become one of the incentives, but also for communication, goals, and ultimately, trust and respect. This results in pathology.

Site Reliability Traditional operations teams and product development colleagues are usually in conflict, most likely at how quickly production software can be released. At the core of it, development teams want to launch new features and see them adopted by users. At their core, the OP teams want to make sure the service is not broken down while holding the pager. Because most breaks are caused by some change – a new configuration, a new feature launch, or a new type of user movement – the goals of both groups are the foundation of the tension.

Both groups understand that it is unacceptable to declare their interests in the harshest terms (“We want to launch anything, anytime, without interruption” versus “We don’t want to change anything in the system once it works”). And because their vocabulary and risk assumptions differ from one another, both groups often use this familiar way of canal warfare to advance their interests. The OPS team is trying to protect the operating system from the risk of change by introducing launch and change rates. For example, launch reviews may contain an explicit check for any issues that ever caused a break – this may be a long list arbitrarily, with not all elements providing equal value The dev team soon learns to respond, with fewer “launches” and more “flag flips,” “cumulative updates,” or “cherry sticks.” They adopt product-like tactics, so fewer features are subject to launch testing.

Google’s Service Management Approach: Web Reliability Engineering
The conflict is not an inevitable part of offering a software service. Google has chosen to operate our systems differently: Website reliability teams focus on hiring software engineers to run our products and creating systems to do the work that would otherwise be done, often manually, by an administrative system.

What exactly is Web Reliability Engineering, as defined by Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was assigned to run a “production team” of seven engineers, all my life up to that point was software engineering. So I designed and managed the team in a way I would like it to work if I worked as an SRE myself. This group has since matured into the current Google SRE team, which remains true to its origins as seen by a lifelong software engineer.

Site Reliability Engineer

Site Reliability Engineer Manager SRE

The main building block of Google’s service management approach is the composition of each SRE team. In general, SREs can be divided into two main categories.

50-60% are Google software engineers and, more specifically, people hired according to the normal Google software engineers procedure. The remaining 40-50% are candidates who are very close to Google’s software engineering skills (ie 85-99% of the required skill set), and who also have SRE but rare technical skills. For most software engineers. Undoubtedly, UNIX system internals and networking expertise (Tier 1 through Tier 3) are the two most common types of alternative technical skills we seek.

Common to all SREs is the belief and ability to develop software systems to solve complex problems. Within SRE, we are closely monitoring the progress of both teams and found that so far there is no practical difference in performance between engineers from either track. In fact, the somewhat diverse background of the SRE team often results in smart and high-quality systems that are clearly the product of the synthesis of some skill systems.

The result of our approach to working for SRE is that we end up with a team of people who (a) get bored quickly by doing

by Benjamin Traynor Sloss 6
Edited by Betsy Beyer

24 Sep 2019

SRE Principals

Principles underlying SRE team work – the patterns, behaviors, and areas of concern that affect the general field of SRE operations.

The first chapter of this section, and the most important section to read if you want to get the broadest picture of what exactly SRE is doing, and how we explain it, is a risk hug. It looks at SRE through the risk lens – its assessment, management, and use of error budgets to provide useful neutral approaches to service management.

sre contract services outsourcingService level goals are another basic conceptual unit for SRE. The industry usually has different chunks in terms of the general title of service level agreements, a tendency that makes it difficult to think about these concepts clearly. Service-level goals try to break away from target metrics from the agreements, examine how SRE uses each of these conditions, and provide some recommendations on how to find useful metrics for your own applications.

A work permit is one of the most important tasks of SRE, and carries the burden of work. We define labor as a value-added, nationwide and repetitive work that compares growth in scope and service.

Whether in Google or elsewhere, monitoring is a vital factor in doing the right thing in production. If you cannot monitor a service, you do not know what is going on, and if you are blind to what is happening, you cannot be reliable. Read distributed monitoring systems for follow-up recommendations and some agnostic implementation best practices.

In Google’s “Automation Evolution,” we look at SRE’s approach to automation, and we will go over some case studies on how SRE implemented automation, successfully and without success.

Most companies regard the release of engineering as a consecutive thought. However, as you will learn in “release engineering”, release engineering is not only critical to the stability of the overall system – since most outages are due to pushing some change. This is also the best way to ensure the editions are consistent.

A key principle in any effective software engineering, and not just reliability-driven engineering, simplicity is a quality that can be especially difficult after being lost. Nevertheless, as the old adage article, a complex system that works necessarily evolved from a simple system that works. Simplicity, goes into this topic in detail.

Further reading from Google SRE
Increasing product speed safely is a fundamental principle for any organization. In the film “Making Push On Green a Reality” [Kle14], released in October 2014, we show that removing humans from the release process can paradoxically reduce the burden of SREs while increasing system reliability.sre contract services

 

24 Sep 2019

SRE Risks and Error Budgets

How the SRE discipline reduces stress on speed / stability between product teams and system operators by quantifying risk and activating error budgets. Striving for 100% availability of service is not simply impossible, it is unnecessary. Maximum stability limits the speed at which new features can be provided to users. Extreme availability yields diminishing as the user experience is controlled by less reliable components like cellular networks or WiFi. We want to reduce the risk of system failure, but we must accept the risk of providing new products and features.

How the SRE discipline reduces stress on speed / stability between product teams and system operators by quantifying risk and activating error budgets. Striving for 100% availability of service is not simply impossible, it is unnecessary. Maximum stability limits the speed at which new features can be provided to users. Extreme availability yields diminishing as the user experience is controlled by less reliable components like cellular networks or WiFi. We want to reduce the risk of system failure, but we must accept the risk of providing new products and features. In the SRE discipline, error budgets are the quantitative measurements that record how much service risk is prepared to endure. Error budgets are a by-product of agreed SLOs (service-level goals) between product owners and systems engineers. Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define "accidents" by quantifying accidents and risk. Error budgets also enforce that "change needs to be gradual," because non-gradual changes can quickly break the SLO and prevent further development in the quarter. This is why we say that SRE class implements DevOps. Embrace a risk You can expect Google to try to build 100% reliable services - services that never fail. Apparently, a certain point has passed, however, increasing reliability for service (and its users) and not better! Extreme reliability costs: Maximum stability limits how nelicited press coverage.

Site Reliability Engineer Manager SRE

In the SRE discipline, error budgets are the quantitative measurements that record how much service risk is prepared to endure. Error budgets are a by-product of agreed SLOs (service-level goals) between product owners and systems engineers. Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define “accidents” by quantifying accidents and risk. Error budgets also enforce that “change needs to be gradual,” because non-gradual changes can quickly break the SLO and prevent further development in the quarter. This is why we say that SRE class implements DevOps.

Embrace a risk

You can expect Google to try to build 100% reliable services – services that never fail. Apparently, a certain point has passed, however, increasing reliability for service (and its users) and not better! Extreme reliability costs: Maximum stability limits how new features can be developed quickly and how quickly products can be delivered to users, dramatically increasing their cost, which in turn reduces the number of features a team can afford. Furthermore, users generally do not notice the difference between high reliability and extreme service reliability, because the user experience is dominated by less reliable components such as the cellular network or device they work with. Simply put, a 99% reliable smartphone user cannot understand the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing operating time, site reliability engineering seeks to balance the risk of unavailability for rapid innovation and efficient service operations so that overall user happiness – with features, service and performance – is optimized.

Risk Management
Unreliable systems can quickly erode user confidence, so we want to reduce the chance of system failure. However, experience shows that when we build systems, the cost does not increase linearly as reliability additions – incremental improvement in reliability may cost 100 times more than the previous increment. The cost has two dimensions:

Cost of resources for unnecessary computers / computers
The cost of unnecessary equipment, for example, allows us to take offline systems for ongoing or unexpected maintenance, or provides us with space to store parity code blocks that provide minimal data compliance guarantee.

The opportunity cost
The cost that an organization incurs when it allocates engineering resources to build systems or features that reduce risk rather than features directly visible to or usable by end users. These engineers are no longer working on new features and products for end users.

At SRE we manage service reliability to a large extent by risk management. We realize the risk as a continuum. We give equal importance to understanding how Google can be more trusted and identify the appropriate level of tolerance for the services we run. This allows us to perform cost / benefit analysis to determine, for example, where in the (non-linear) risk sequence we should place search, ads, Gmail, or images. Our goal is to explicitly align the risk of taking a given service with the risk the business is willing to bear. We strive to make service reliable enough, but not more reliable than it should be. That is, when we set a 99.99% availability target, we want to exceed it, but not much: it will waste opportunities to add features to the system, clear technical debt or reduce its operating costs. In a sense, we see that the target of availability is minimum and maximum. The main advantage of this framing is that it opens an opportunity for explicit and thoughtful risk-building.Site Reliability Engineer Manager SRE

Service risk measurement
As a standard practice at Google, we usually do our best by identifying an objective metric for representing the property of a system that we want to optimize. By setting a goal, we can assess our current performance and track improvements or degradations over time. Regarding service risk, it is not immediately clear how all potential factors can be reduced to a single value. Service failures can have many potential impacts, including user dissatisfaction, injury, or loss of trust; Direct or indirect loss of income; Influence on brand or reputation; Unsolicited press coverage.

24 Sep 2019

Defining the terms of your SRE

Defining SLO SLI SLA the terms of site reliability engineering

These tools are not just useful abstractions. Without them, you cannot tell if your system is reliable, available or even useful. If they are not explicitly related to your business goals, you have no data on whether the choices you make are helping or hurting your business.

As a refresher, here is a look at SLOs, SLAs, and SLIS, as discussed by AJ Ross, Adrian Hilton, and Dave Rensin of our customer reliability engineering team, in a January 2017 blog post, SLOs, SLIs, SLAs, oh my – CRE life lessons.

1. Service Level Purpose (SLO)

SRE starts with the idea that a necessary condition for success is availability. An unavailable system cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is capable of fulfilling its intended function at a point in time. In addition to using reporting tools, historical availability measurement can also describe the probability that your system will perform as expected in the future.

sre engineeringWhen we set out to define the SRE conditions, we wanted to set a precise numerical target for system availability. We call this target the goal of the available level of service (SLO) of our system. Any discussion that we have in the future about whether the system works reliably enough and which design or architectural changes we need to make in it should be framed in terms of our system continuing to comply with this SLO.

Keep in mind that the more reliable the service, the higher the activation cost. Set the lowest level of reliability you can avoid for each service, and set it as your SLO. Every service should have SLO availability – without it, your team and stakeholders cannot make any fundamental considerations about having to make your service more reliable (increase cost and slow down development) or less reliability (allowing for higher development speed). Excessive availability can become a problem because that is now the expectation. Do not make the system overly reliable if you do not intend to commit it to being so reliable.

Within Google, we apply periodic downtime on some services to prevent over-provisioning of services. You may occasionally try to do planned downtime with front-end servers, as we did with one of our internal systems. We have found that these exercises can expose services that use these servers inappropriately. With this information, you can move workloads to a more suitable place and keep servers at the right availability level.

2. Service Level Agreement (SLA)

At Google, we differentiate SLO from a service level agreement (SLA). SLA usually involves assuring someone who uses your service that their availability SLO should meet a certain level over a period of time, and if they fail to do so then some penalty will be paid. This may be a partial reimbursement of the subscription fee to the service paid by the customers for that period, or additional subscription time added for free. The idea is that leaving SLO is going to hurt the service team, so they will push hard to stay within SLO. If you charge your customers money, you will probably need SLA.

Because of this and because of the principle that availability should not be much better than SLO, SLO availability in SLA is usually a more liberated goal than SLO for internal availability. This may be reflected in availability numbers: for example, SLO availability of 99.9% during one month, with SLO internal availability of 99.95%. Alternatively, the SLA may specify only a subset of the values ‚Äč‚Äčthat make up the internal SLO.Defining SLO SLI SLA the terms of site reliability

If you have an SLO in your SLA that is different from your internal SLO, as it almost always is, it is important that your tracking explicitly measures SLO compliance. You want you to be able to view your system availability during the SLA calendar period, and easily see if it appears to be in danger of leaving SLO. You will also need accurate measurement of compliance, usually from log analysis. Because we have an additional debt system (described in SLA) towards paying customers, we need to measure queries received separately from other queries. This is another benefit of establishing an SLA – it is an unequivocal way to prioritize traffic.

When defining the SLO availability of your SLA, you must be especially careful about the questions you consider to be legitimate. For example, if a customer goes through a quota because they released a buggy version of their mobile client, you might consider excluding all response codes “out of quota” from your SLA accounting.

3. Service Level Indicator (SLI)

We also have direct measurement of service behavior: the frequency of successful testing of our system. This is a service level indicator (SLI). As we evaluate whether our system has been operating within SLO over the past week, we look at SLI to get the percentage of service availability. If it is below the specification

gcp_stackdriver_dashboardmml0.PNG
Automated dashboards in Stackdriver for GCP services allow you to group several ways: service, method, and any response code for each of the 50th, 95th, and 99th percentile charts. You can also see log latency charts to find partitions quickly.If you are building a system from scratch, make sure that SLI and SLO contain some of your system requirements. If you already have a production system but are not yet clearly defined, this is your top priority. If you come to Next ’18, we look forward to seeing you there.See related content:Learn how SRE is different from DevOps
Read about SLOs for services with dependencies

Site Reliability Manager. All rights reserved 2019 - 2036
Top