24 Sep

Defining the terms of your SRE

Defining SLO SLI SLA the terms of site reliability engineering

These tools are not just useful abstractions. Without them, you cannot tell if your system is reliable, available or even useful. If they are not explicitly related to your business goals, you have no data on whether the choices you make are helping or hurting your business.

As a refresher, here is a look at SLOs, SLAs, and SLIS, as discussed by AJ Ross, Adrian Hilton, and Dave Rensin of our customer reliability engineering team, in a January 2017 blog post, SLOs, SLIs, SLAs, oh my – CRE life lessons.

1. Service Level Purpose (SLO)

SRE starts with the idea that a necessary condition for success is availability. An unavailable system cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is capable of fulfilling its intended function at a point in time. In addition to using reporting tools, historical availability measurement can also describe the probability that your system will perform as expected in the future.

sre engineeringWhen we set out to define the SRE conditions, we wanted to set a precise numerical target for system availability. We call this target the goal of the available level of service (SLO) of our system. Any discussion that we have in the future about whether the system works reliably enough and which design or architectural changes we need to make in it should be framed in terms of our system continuing to comply with this SLO.

Keep in mind that the more reliable the service, the higher the activation cost. Set the lowest level of reliability you can avoid for each service, and set it as your SLO. Every service should have SLO availability – without it, your team and stakeholders cannot make any fundamental considerations about having to make your service more reliable (increase cost and slow down development) or less reliability (allowing for higher development speed). Excessive availability can become a problem because that is now the expectation. Do not make the system overly reliable if you do not intend to commit it to being so reliable.

Within Google, we apply periodic downtime on some services to prevent over-provisioning of services. You may occasionally try to do planned downtime with front-end servers, as we did with one of our internal systems. We have found that these exercises can expose services that use these servers inappropriately. With this information, you can move workloads to a more suitable place and keep servers at the right availability level.

2. Service Level Agreement (SLA)

At Google, we differentiate SLO from a service level agreement (SLA). SLA usually involves assuring someone who uses your service that their availability SLO should meet a certain level over a period of time, and if they fail to do so then some penalty will be paid. This may be a partial reimbursement of the subscription fee to the service paid by the customers for that period, or additional subscription time added for free. The idea is that leaving SLO is going to hurt the service team, so they will push hard to stay within SLO. If you charge your customers money, you will probably need SLA.

Because of this and because of the principle that availability should not be much better than SLO, SLO availability in SLA is usually a more liberated goal than SLO for internal availability. This may be reflected in availability numbers: for example, SLO availability of 99.9% during one month, with SLO internal availability of 99.95%. Alternatively, the SLA may specify only a subset of the values ‚Äč‚Äčthat make up the internal SLO.Defining SLO SLI SLA the terms of site reliability

If you have an SLO in your SLA that is different from your internal SLO, as it almost always is, it is important that your tracking explicitly measures SLO compliance. You want you to be able to view your system availability during the SLA calendar period, and easily see if it appears to be in danger of leaving SLO. You will also need accurate measurement of compliance, usually from log analysis. Because we have an additional debt system (described in SLA) towards paying customers, we need to measure queries received separately from other queries. This is another benefit of establishing an SLA – it is an unequivocal way to prioritize traffic.

When defining the SLO availability of your SLA, you must be especially careful about the questions you consider to be legitimate. For example, if a customer goes through a quota because they released a buggy version of their mobile client, you might consider excluding all response codes “out of quota” from your SLA accounting.

3. Service Level Indicator (SLI)

We also have direct measurement of service behavior: the frequency of successful testing of our system. This is a service level indicator (SLI). As we evaluate whether our system has been operating within SLO over the past week, we look at SLI to get the percentage of service availability. If it is below the specification

gcp_stackdriver_dashboardmml0.PNG
Automated dashboards in Stackdriver for GCP services allow you to group several ways: service, method, and any response code for each of the 50th, 95th, and 99th percentile charts. You can also see log latency charts to find partitions quickly.If you are building a system from scratch, make sure that SLI and SLO contain some of your system requirements. If you already have a production system but are not yet clearly defined, this is your top priority. If you come to Next ’18, we look forward to seeing you there.See related content:Learn how SRE is different from DevOps
Read about SLOs for services with dependencies

Leave a reply