SRE Principals
Mike Romero
Introduction
Thought I would quickly go over the key principles of Site Reliability Engineering (SRE) and why they matter. Even though these days I am not working in the title, I wanted to address the theory side of things and how this shapes my actions as a Platform Engineer.
My previous role, I was an SRE Architect. And for full transparency, the reason I left was because of economic concerns. The time I was there, the sales pipeline seemed to have dried up, and I was not sure if I wanted to weather the storm in a consulting company that often let people go for being on the bench too long. NetDocuments has been a great change to that uncertainty, and I am indeed, happy to be here.
SRE Principles
So what is SRE? And the question I always seemed to be asked is, “Isn’t that just DevOps?” The answer, in short, is no. Google puts it the best, in that SRE is “what happens when you ask a software engineer to design an operations function.” It is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. Note how I could transition into being a Platform Engineer, as the two roles are very similar in nature. minus the focus on reliability and uptime.
SRE is about planning, observing, reacting, and wholistic thinking when it comes to systems at scale. In other words, it is about thinking of systems as a whole rather than as just a collection of individual components. To do this, SREs use simple principals to guide their work:
- Embrace Risk: Understand that risk is inherent in any system and that it can be managed, not eliminated. SREs focus on balancing risk with the need for reliability.
- Service Level Objectives (SLOs): Define clear SLOs to measure the reliability of services. SLOs help teams understand what is expected and how to prioritize work. to do this, SREs need to understand the design of an application, and associated SLAs (Service Level Agreements) and SLIs (Service Level Indicators). SLOs are the target reliability goals that a service should meet, and they are often expressed as a percentage of successful requests or uptime over a specific period.
- Error Budgets: Use error budgets to balance the pace of innovation with the need for reliability. An error budget is the amount of allowable downtime or errors within a given period, which helps teams make informed decisions about deploying new features.
- Automation: Automate repetitive tasks to reduce toil and free up time for more valuable work. SREs strive to automate as much as possible to improve efficiency and reliability. Automation should always be treated as a first-class citizen, and SREs should strive to automate everything that can be automated. With the advent of AI, the speed of which seems to be infinite.
- Monitoring and Observability: Implement robust monitoring and observability practices to gain insights into system performance and health. This includes collecting metrics, logs, and traces to understand how systems behave in production. SREs use these insights to detect issues early and respond proactively with the associated dev teams that own systems and software.
- Incident Management: Develop effective incident management processes to quickly respond to and recover from incidents. This includes having clear communication channels, post-incident reviews, and continuous improvement practices. Primarily through shift left practices, such as blameless postmortems so dev teams can learn from failures without fear of blame.
- Capacity Planning: Plan for capacity to ensure that systems can handle expected loads without degradation in performance. SREs use data-driven approaches to forecast capacity needs and avoid outages. This also involves setting realistic scaling expectations and ensuring that systems can scale up or down as needed.
- Continuous Improvement: Foster a culture of continuous improvement by learning from failures and successes. SREs encourage teams to reflect on their work, share knowledge, and implement changes that enhance reliability and performance. It is also important to realize that innovation requires failure, and that failure is a part of the process. SREs should embrace failure as a learning opportunity and not as a setback.
Conclusion
In conclusion, I just wanted to put together a quick post on this, and why it is critical to the success of any organization that relies on technology. SRE is not just a role, but a mindset that emphasizes reliability, automation, and continuous improvement. By following these principles, organizations can build resilient systems that meet the needs of their users while also allowing for innovation and growth.
And that SRE is not DevOps, so please stop asking me reset your services, or to help you with your CI/CD pipeline. ;)