Connection Information

To perform the requested action, WordPress needs to access your web server. Please enter your FTP credentials to proceed. If you do not remember your credentials, you should contact your web host.

Connection Type

LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations

Image result for kurt andersen linkedin sreKurt Andersen is an engineer who is fascinated by how entire systems interrelate. Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on how to make hundreds of constantly moving parts work together. Blameless interviewed Kurt to shine light on the blind spots that companies often have when implementing SRE.  

Besides his role as a senior staff site reliability engineer at LinkedIn, Kurt is also sitting on the board of USENIX, an organization that hosts a wealth of conferences that bring together top professionals in the computing world, including SREcon.

Here are the key nuggets of SRE wisdom from Kurt in the interview.

 

SRE = available + secure

Availability has the main spotlight whenever people explain the purpose of Site Reliability Engineering. However, LinkedIn shares the spotlight with an additional emphasis: security. The SRE team at LinkedIn works to keep the site available and secure. Data privacy and integrity are top priorities to LinkedIn’s SRE team.

The SRE team at LinkedIn works to keep the site available and secure.

 

Differentiating DevOps vs. SRE

Many DevOps engineers are still convincing their organizations the value of continuous integration (CI) and continuous delivery (CD). CI and CD are designed to do things faster, but that does not always mean doing the right things.

SRE teams focus on business success. Organizations with SRE teams tend to already have CI/CD as a staple, rather than a source of resistance. SRE builds on top of CI/CD and ensures that whatever moves fast contribute to business success. (See chapter 22 in the book Seeking SRE for detailed explanations from Kurt.)

 

Key Success Factor to SRE

Culture. A blameless culture is one that encourages learning and continuous improvement.

 

Feature Developers’ Blindspot: Retirement of their Services

Most feature developers don’t plan for retirement of features. Microservices gives you the illusion that you can yank and replace, but that’s not really the case. It’s tough to turn off a microservice without losing an arm or leg. That’s why it’s important for SREs to have a full life cycle engagement, providing input starting from the design phase, so we can avoid the high cost of fixing bugs (and retiring features) later. When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.

When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.

 

Terminology Confusion: SLO or SLA?

For companies that do not suffer financial penalties for violating Service Level Agreements (SLA), the internal engineering team tends to use SLO and SLA interchangeably. SLO, service level objective, is really an internal metric for services that depends on another service. Distinguishing the two will help with communications clarity when SLA does become important (or tied to dollar amount penalties).

 

Coming Up with Meaningful SLOs – a Missing Protocol

How would you come up with the best and most reasonable SLO for availability, latency (site speed), error rate, performance relative to traffic load, or how a service performs under stress conditions? You can’t, not at the beginning. It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.

For example, at Home Depot, SLOs are reviewed every 6 months. Teams can revise to have tighter or looser SLOs (E.g. Going from 99% availability to 99.5% or 98%). Each team at an organization can review their SLOs at a tempo that works that them. The key is to have a regular means to adjust rather than signing a lifelong commitment. (See chapter 3 in The Site Reliability Workbook for more details.)

It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.

 

SLO Challenge: Measuring the Business Impact of Grey Failures

A grey failure refers to partial failure of a system, for example, if a specific feature of LinkedIn were to stop responding only in Canada. Calculating the impact of a grey failure is difficult. The estimates are rough, the process is manual, and it’s difficult to take into account any bounce back effect. When Amazon Prime went down on Prime day, possibly more customers came back the next day to buy more, however, it’s also possible that what customers wanted to buy had already been sold out. Because it’s difficult to quantify the business impact, we currently bucket impact into 3 categories: minor, major, and critical; and prioritize accordingly.

 

Vision for SRE

SRE brings ongoing emphasis and continual drumbeat on the importance of reliability, like what QAs do for unit testing. In an ideal world, every engineer will take reliability into account for everything they do.

 

Written by Charlie Taylor

Comments
  • Subbu
    Reply

    Nice one.

    For retirement of services: though, it is not mentioned in the post, the Bathtub Curve model could be to determine the approximate time to retire the services.

    SLO define the agreed(?) targeted levels of service measured by one or more SLI.

    IMHO, just like other terms in Computer industry, SRE is another confusing term (a site reliability talking about service availability), and it has been there since ‘dynamic programming’ days to ‘mouse’ to and now ‘cloud’ computing era.

    IT department (co-located) in a company operating a solution/system/service (say, a mail server or expense-application software) for internal employees of company vs IT department (geo distributed) operating a bigger system (geo distributed) for bigger number of customers (internal and external, distributed globally) – and therefore, you would need more expertise, different tools and processes ; fundamentally, you do the same ‘work’. {farming, hunting those days to gene modification to writing program to (let robots) program .

    (did anyone notice that when you type your name, email, site tex boxes below the foreground is all white and therefore you can’t see what you typed?). Is it only for me?. And when you delete those things, you just can’t see that they are there.. since it is borderless. very confusing and less user friendly, IMO.

Leave a Comment