Site Reliability Engineering Link to heading
I have probably written and re-written this post a thousand times over my career. I am surprised that I often find SRE such a mystery for businesses and companies, even with “mature SRE Organizations” as they put it. I have seen it as an Operations organization, DevOps by any other name organization, even as an IT department, and as a hybrid of all of the above. SRE kind of falls into that buzzword category for organizations looking to transform or change their direction without really knowing how.
SREs are really a skilled group of engineers looking to unclog issues in an organization, in a vein similar to those YouTube videos of people unblocking storm drains.
So what is SRE? Link to heading
SRE is, in concept, simple really. It is about Engineering Reliable Software. Or, treating “Reliability” like a software feature rather than a goal or desired outcome. It is looking at Service Level Objectives and Key Indicators and going “What can i do with all of this downtime?” and “How stable must my system really be?” rather than “How do i prevent it at all costs?!”
SRE has a simple goal. Identify realistic reliability targets with the help of business and product needs, and help design and implement changes to achieve those targets. And when we exceed those targets, turn the unreliable time into valuable time for system wide change. We even have words for the targets, Service Level Agreements (SLAs) are the targeted agreements with customer. Service Level Objectives (SLOs) are the targets we set to use internally and achieve them and give us room for outages and planned changes. We accomplish and measure them by identifying Service Level Indicators (SLIs - similar to KPIs) that match what our software’s purpose or function. With these tools we can really see what we need to accomplish
It is as simple as that. There are a lot of processes that can be invoked or created to accomplish this. Blameless postmortems or revitalizing processes related to monitoring instrumentation or incident management. We might work with a team closely and suggest additional shift left priorities. We might even delve into Operations, or act as operators in a production environment for a time. We also might work in feature work on applications or debugging with a team. It really depends on an SRE skill set and what they are comfortable working with. But the goal with all our tasks remains the same. Collect, plan, act. The result is improved Reliability.
How does SRE work practically? Link to heading
It really is a three phase approach.
First phase is metric analysis and review. A deep Dive and then subsequent alignment based on metrics and measurable data starts with the team. If metrics or SLIs are not immediately present, the development of SLIs and instrumentation might need to be the focus of this work. SLIs, DevOps metrics, Incident metrics, and maybe even incident postmortems may all be a part of this investigation phase. This is all done with the Team’s help, and supported by the SREs and Team working together. Almost like reinforcements to the team arriving.
Second is the Planning and action phase. We have the problems identified or where the software’s shortcomings might be. We now need to make an action plan. Sitting with the team, we huddle and begin a planning session. What actions can we take to solve these issues and then we embrace the Agile process and encapsulate action into work items. And this is where the work begins. If the SRE team is needed to embed and help solve the problem, they can. In larger organizations that are correctly aligned, the SREs might being doing a few of these projects at a time and need to pivot to the other tasks while the Engineering or Dev Teams get to work.
the third phase is retrospective and follow up in nature. It might turn into a mini phase 1 and 2 cycle. It might just be a retrospective and departure. Whatever the third phase actions may be, the key is that when the SRE disengages from the team they are working with, they leave with the action steps in mind and documented, and potential offenders of the problem might be targeted next proactively.
Sounds Like a Fairytale Link to heading
Fair.
SRE is for two groups.
- Mature Organizations looking to increase overall reliability
- Organizations that are on fire, desperately trying to keep their employees retained and their bottom line stable.
In my more pragmatic assumptions, most readers of this post are a part of group 2.
SRE relies on DevOps Culture to be implemented. No, not that overly used and abused term “DevOps Engineer”. DevOps culture. Or that other thing everyone mistakes the term for. This may be an SRE organization’s first, and monumental task.
While I am not going to cover this in depth here, this is a Microsoft Learn course on it that I would suggest you take.
If you need to know more Link to heading
I consult, but also, I like to teach. If you are a decision maker, reach out and we can talk about on-site trainings or analysis. If you are an SRE or Engineer like myself, and lost in the stream, hit me up on LinkedIn.
Thanks for reading. Cheers!
FAQ Link to heading
Some Frequently asked questions and their answers:
- Is SRE an Operations Organization? - No. It is an Engineering Organization that can focus on Operational tasks, but from an Engineering prospective. Operators operate - Click buttons, make small scripts and tools, accomplish tasks. Engineers design cost effective solutions, ranging from helping operators with tools and scripts to full on automation and application development.
- Do SREs work in Operations? - Yes, in Engineering Solutions rather than, like stated above, in the operations tasks.
- Does an SRE fall on the Dev or Ops side of the Software Delivery Lifecycle? - Honestly, at this point, this is antiquated thinking. One should keep an open mind to the concept of DevOps Culture and shifting left, and lean into the idea that Development teams ship and own their software from cradle to grave. Operations has a purpose, but even they should be focusing on becoming operations engineers rather than strict operators.
- What if the Organization or organization’s processes for IT are the issue or a hang up? - Then start there. SREs need to be process focused and champions of best practices to increase reliability. In Google they even have “Customer Reliability Engineers” where SREs just focus on a customer coming to Google Cloud and engineering solutions that best adapt these customers to Google technologies. They key for success here is executive buy in at the highest levels that reliability is indeed a feature of the product. CTO endorsement and empowerment is critical for example. Then look at the organization and plan with managers/directors/VPs how to bring things in line or a roadmap to accomplish wider goals.
- What if i am scared to do step 4? - I’ve spoken to a lot of executives in my time. They want to hear from you, but don’t have time to waste. Tell them the problem in one sentence or keep communication streamlined and to the point, then give them a brief plan of action. Brevity is KEY. Then request time to meet with them and present options if the need. If you are worried on your skills, level up and do soft skill learning on presentations and emails rather than just focus on engineering tasks. This is the biggest difference between staff and principal engineers. Soft skills.
Additional Resources Link to heading
- SRE.Google - is the core site of canon for SRE. Google invented the role and title, and Google invests in keeping it active and alive
- DevOps Resource Center - This is a resource site for DevOps Culture. Something any Engineer or Developer should know. After all, it is 2024, and we all follow some form of the Software Delivery Lifecycle. We just need to tear down some walls.
- A cat for your troubles