SREcon17: Brave new world of site reliability engineering

Site reliability engineers run some of the largest websites on the planet—and are inventing a new field of expertise while they do it

SREcon17: Brave new world of site reliability engineering
LoganArt via Pixabay (CC0 Public Domain)

Last week, I was fortunate to participate in SREcon17 Americas, a conference organized by USENIX for site reliability engineers. What’s a site reliability engineer (SRE)? Ben Treynor, founder of Google's site reliability team, once explained it’s "what happens when a software engineer is tasked with what used to be called operations."

An important role in the DevOps practice, these engineers concentrate on reliability (of course) and scalability (at amazing levels) in highly distributed systems (microservices multiplying like rabbits). They run some of the largest websites on the planet and are inventing a new field of expertise while they do it.

Recordings of the conference sessions will soon be posted, so rather than summarize lots of presentations, let me share some of the culture and spirit observed at this notable gathering.

The remarkable Julia Evans, an SRE with Stripe, opened the conference with a talk called, "So you want to be a wizard?" Many presentations at other conferences seem designed to convince the audience that the speaker is a Very Serious Expert, so it’s surprising to read an abstract that begins:

I don't always feel like a wizard. I'm not the most experienced member on my team. Like most people, I sometimes find my work difficult, and I still have a TON TO LEARN.

Yet this honesty conveys the enticing reality of the work. Distributed systems are inherently complex, consisting of myriad components, any combination of which can cause knotty problems, especially at scale. Like a good detective, a dedicated SRE follows the facts where they lead, learning along the way. The combination of humility, curiosity and bravery makes the work seem as compelling as a good episode of CSI.

SRE training, recruiting

Training and recruiting were big topics at the conference because current demand for SRE skills greatly outstrips supply. Asked what to say when a novice SRE makes their first big service-impacting mistake, one panelist suggested: “Congratulations, and welcome to the club.”

Many practitioners are self-taught because DevOps in general and site reliability engineering in particular are such new fields. The best-selling compendium Site Reliability Engineering was published just a year ago.

“Job seeker” badge ribbons were available at registration so hungry companies could find talent. For engineers with a networking background, it’s interesting to compare the Google search trends for DevOps vs. CCIE.  Not to suggest that networking expertise is unimportant; on the contrary. But putting that expertise in the context of site reliability is now a key focus. The opportunities and excitement were obvious throughout SREcon.

Awesome scale loomed everywhere. Jeff Barber of Facebook recalled when a megabyte seemed large. Now a billion seems like a quotidian number. Casey Rosenthal of Netflix casually mentioned that their streams now account for about a third of the bits on the internet. Even under this staggering load, they purposely kill random parts of their production system as an ongoing test of the hypothesis that their design is resilient to failures. This ongoing Chaos Monkey approach helped ensure that Netflix was not impacted by the recent Amazon S3 outage.

Improving diversity

Diversity at SREcon seemed better than usual. For instance, compared to other technology gatherings, women were noticeably better represented (though still not half). Why such an SRE affinity should arise is unclear, but it’s welcome.

As an example of the accepting tone, a set of badge ribbons available at registration allowed conference attendees to alert others to their pronoun preference (she/her, he/him, they/them, or fill in the blank). And a session on opening the field to everyone was well attended, with discussion that was sometimes awkward, but always optimistic and in good faith.

Incidentally, Susan Fowler who wrote the explosive blog entry about sexual harassment at Uber is a site reliability engineer. Now with Stripe, she wrote another popular book Production-Ready Microservices about the field.

A shout out to my hosts for the day at Netsil. The company offers tools for service-level monitoring as a black box, ideal for an SRE who might not be able to instrument code (especially for external services). Read more about this approach in their blog, or better yet, download their tools and try them out. They offer versions for just about every environment, from AWS to containers, and almost everyone who tries it learns something new about their systems. (Disclaimer: I serve as an informal advisor to the company. They sensibly pick and choose which advice to take and which to ignore.)

Recordings and slides from the SREcon sessions will be posted in the coming weeks. Explore the program website or follow updates on Twitter. And watch for SREcon events in Singapore (May) and Dublin (August).

Copyright © 2017 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022