24 Sep

Hope is not a strategy. You must have competent SRE

 

Hope is not a strategy.

Saying traditional SRE

It is a universally recognized truth that systems do not run themselves. How, then, should a system run – especially a complex, large-scale computing system?

Sysadmin access to service management
Historically, companies have employed system administrators to operate complex computing systems.

This system administrator, or sysadmin, involves assembling existing software components and deploying them to collaborate on service production. Sysadmins are then tasked to run the service and respond to events and updates as they occur. As the system grows in complexity and volume of traffic, generating a parallel increase in events and updates, the sysadmin team grows to absorb the extra work. Because the sysadmin role requires a much different set of skills than that required by product developers, developers and sysadmins are divided into separate teams: “development” and “operations” or “op”.

The service management system model has several advantages. For companies deciding how to operate and adapt a service, this approach is relatively easy to apply: As a recognized industry paradigm, there are many examples that can be learned and emulated. A relevant talent pool is already widely available. A variety of existing tools, software components (off the shelf or otherwise) and integration companies are available to run the complex systems, so the beginning sysadmin team does not have to reinvent the wheel and design a system from scratch.

The sysadmin approach and the split development / options have a number of drawbacks and pitfalls. These are broadly divided into two categories: direct costs and indirect costs.

Direct costs are not subtle and are not vague. Running a service with a team that relies on manual intervention for both change management and event management becomes costly as service and / or service traffic increases, as staff size necessarily means system-generated load.

The indirect costs of development / option splitting can be subtle, but are usually more expensive to the organization than the direct costs. These costs are due to the fact that the two teams are quite different in background, skill set and incentives. They use a different vocabulary to describe situations; They make different assumptions about risk and options for technical solutions; They have different assumptions about the target level of product stability. Splitting between teams can easily become one of the incentives, but also for communication, goals, and ultimately, trust and respect. This results in pathology.

Site Reliability Traditional operations teams and product development colleagues are usually in conflict, most likely at how quickly production software can be released. At the core of it, development teams want to launch new features and see them adopted by users. At their core, the OP teams want to make sure the service is not broken down while holding the pager. Because most breaks are caused by some change – a new configuration, a new feature launch, or a new type of user movement – the goals of both groups are the foundation of the tension.

Both groups understand that it is unacceptable to declare their interests in the harshest terms (“We want to launch anything, anytime, without interruption” versus “We don’t want to change anything in the system once it works”). And because their vocabulary and risk assumptions differ from one another, both groups often use this familiar way of canal warfare to advance their interests. The OPS team is trying to protect the operating system from the risk of change by introducing launch and change rates. For example, launch reviews may contain an explicit check for any issues that ever caused a break – this may be a long list arbitrarily, with not all elements providing equal value The dev team soon learns to respond, with fewer “launches” and more “flag flips,” “cumulative updates,” or “cherry sticks.” They adopt product-like tactics, so fewer features are subject to launch testing.

Google’s Service Management Approach: Web Reliability Engineering
The conflict is not an inevitable part of offering a software service. Google has chosen to operate our systems differently: Website reliability teams focus on hiring software engineers to run our products and creating systems to do the work that would otherwise be done, often manually, by an administrative system.

What exactly is Web Reliability Engineering, as defined by Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was assigned to run a “production team” of seven engineers, all my life up to that point was software engineering. So I designed and managed the team in a way I would like it to work if I worked as an SRE myself. This group has since matured into the current Google SRE team, which remains true to its origins as seen by a lifelong software engineer.

Site Reliability Engineer

Site Reliability Engineer Manager SRE

The main building block of Google’s service management approach is the composition of each SRE team. In general, SREs can be divided into two main categories.

50-60% are Google software engineers and, more specifically, people hired according to the normal Google software engineers procedure. The remaining 40-50% are candidates who are very close to Google’s software engineering skills (ie 85-99% of the required skill set), and who also have SRE but rare technical skills. For most software engineers. Undoubtedly, UNIX system internals and networking expertise (Tier 1 through Tier 3) are the two most common types of alternative technical skills we seek.

Common to all SREs is the belief and ability to develop software systems to solve complex problems. Within SRE, we are closely monitoring the progress of both teams and found that so far there is no practical difference in performance between engineers from either track. In fact, the somewhat diverse background of the SRE team often results in smart and high-quality systems that are clearly the product of the synthesis of some skill systems.

The result of our approach to working for SRE is that we end up with a team of people who (a) get bored quickly by doing

by Benjamin Traynor Sloss 6
Edited by Betsy Beyer

Leave a reply