Introduction
In today’s modern enterprise, technology continues to be the ultimate force-multiplier for business success. Technology enables us to serve customers faster, manage logistical planning down to the second while ensuring capacity is leveraged, and even helps predict customer behavior and needs, in an effort to better serve them.
Ultimately, the success (or failure) of these efforts relies on the organizations responsible for ensuring it all works, 99.95% of the time. That’s an allowance of 22 minutes of downtime, per month, or less than four hours and 23 minutes per year – and a lot of pressure on site reliability engineers to ensure that those high-performance SLA targets are met.
In order to achieve a lofty SLA, like “3 9s and a 5,” the engineering and operations organizations responsible for building and delivering technology must work closely together, from inception to delivery. All the while, following industry best practices along the way to ensure you’ve considered every ounce of prevention you can, and then being prepared to respond in the event the inevitable happens and a production incident occurs.
The top five best practices we’ve outlined here are designed to help spark the right conversations and guide your journey as you advance the maturity of your operations and SRE team. Where do you want to be in 30 days, six months, or a year from now? Following the best practices provided here will help drive your availability towards the magic number – 99.95% – and beyond.
What You'll Learn
The Five SRE Best Practices for the Modern Enterprise
- Engage early
- Monitor and observe
- Incident management
- Experiment
- Optimize
Site Reliability Engineering
Site reliability engineering (SRE) has permeated all manners of technology organizations since Benjamin Treynor Sloss first asked the question at Google, “What would happen if we used software engineers to solve operational problems?” It’s since grown from there, and today SREs impact all phases of the software lifecycle – everything from ensuring Product and Engineering teams consider requirements that support reliability, such as graceful degradation, to validating automation and the architecture of massive internal platforms and services. In fact, I believe the most fundamental requirements to achieving high-performance SLA targets within enterprise organizations begins with a foundational site reliability engineering strategy.
SRE groups can be viewed as the “Guardians of Production.” After all, their primary role is to fight for the user. On its surface, that may inspire visions of SREs constantly at battle with Product and Engineering organizations, persistently fighting to prioritize reducing tech-debt and toil over features. Unfortunately, if relationships are not carefully maintained, that can occur, but it’s certainly not the desired state. SREs should not drive results with a policy hammer, but instead with white-gloves and a soft touch.
The most fundamental requirement for building a successful SRE organization is trust.
The most fundamental requirement for building a successful SRE organization is trust. Everyone throughout the organization must trust that the SRE group will consistently make the right decision to ensure the success of the product in question. SREs must build trust through partnerships across all levels of the organization, driving results by simultaneously contributing to and maintaining product, automation, and platforms that comprise the portfolio of an organization.
Only when SREs have demonstrated the ability to think outside the box, operate at an incredibly high level, and help the product group meet their goals, will they be trusted to influence decisions and expand their sphere of influence. Once SREs achieve this level of trust, they can begin influencing decisions that drive additional availability in production, and, ultimately, protect the end user’s experience.
This brings us, naturally, to our first best practice, because building this trust requires that product managers engage early with SREs.
Best practice #1: Engage early
Let’s take a scenario that may sound all too familiar. It’s two weeks before launch, and the product team is just starting to consider monitoring requirements. Product owners collect requirements and set goals for an application, for example how long it should take for the home page to render. Once the goals are identified and established, we know what we want to measure but we still need to determine how that will be monitored. At the same time, reliability and operational readiness are foundational to meeting the goals set by product management and establishing the requirements for monitoring and measuring performance.
Engaging early makes a big impact on your future success. Consider engaging the following stakeholders early on in the process:
- Product Managers
- Line of Business Owners
- Engineering Leads
- Operational Leads
- Platform Leads
While it is certainly possible to implement monitoring capabilities just prior to launch, if you are looking to meet the minimum requirements for observability, it won’t be cheap to move that fast. Still, dollar for dollar, spending on monitoring has some of the greatest ROI when it comes to ensuring you can meet performance and availability goals long-term – to deprioritize them simply because it may be expensive at this point in product development can be a short-sighted mistake.
While it is certainly possible to implement monitoring capabilities just prior to launch, if you are looking to meet the minimum requirements for observability, it won’t be cheap to move that fast. Still, dollar for dollar, spending on monitoring has some of the greatest ROI when it comes to ensuring you can meet performance and availability goals long-term – to deprioritize them simply because it may be expensive at this point in product development can be a short-sighted mistake.
On the other hand, if you had considered monitoring requirements at the very beginning of a requirements gathering session with the site reliability engineering team, you may have had more time to implement custom metrics in your code, and use an open-source framework. You would have had the ability to set up a capability that is good and cost-effective, but it requires more investment in your team’s time – something you don’t have if you’ve waited until you’re about to deploy to production. Here’s where engaging with the site reliability engineering team early would have helped.
Product management and site reliability engineering may work on different teams, but when it comes to application goals and performance, they need to work hand-in-hand. It is critical to engage SREs in this process before a single line of code is written because they:
- Will be responsible for services that contribute to application reliability and performance in production.
- Are concerned with multiple aspects of a service, including architecture and dependencies, metrics and monitoring, incident response, capacity planning, and more.
- Will prescribe the tools that the organization will use to monitor and measure against goals.
SREs can forge a solid working relationship with product managers early, and streamline the collaboration process, by providing a production readiness checklist that offers product managers guidance for what they need to think about, including:
- What are your goals for availability?
- How will you measure your application to ensure you’re meeting those goals?
- Where is your application architecture documented?
- Have you considered compute redundancy? (N+2, Network QoS, physical compute location)
- What DNS and Domains are required for this application? Is DNS critical to your load balancing strategy?
- Have you estimated traffic and bandwidth requirements?
- Have you load-tested to ensure capacity and performance at X% beyond traffic forecasts?
- What happens if you lose a (node, VM, server, rack, cluster)?
- What happens to your application if you lose network connectivity?
- What happens when your application loses connection to a backend service or API?
- How do you restart, rebuild or scale your application without impacting customer experience?
- How will you monitor to detect any of the above scenarios?
Once the checklist is complete, SREs will know what to monitor and observe to establish the performance thresholds that need to be met.
Best practice #2: Monitor and observe
When it comes to best practices, what’s old is new again, which is why, as we set out to build and deliver capabilities that meet high-performance goals, the scientific method should guide all of our decision-making. It’s critical that we can form a hypothesis based on our current observations, as it relates to our desired outcome or state, test that hypothesis with an experiment, and then observe the results of that experiment to form our next hypothesis for continuous improvement of our systems and capabilities.
The first step in this process? Take an observation as it relates to the current behavior of your application, or as it’s come to be known, observability. By ensuring we have the ability to observe the inner workings of our systems, we can ensure we are able to make effective decisions and form hypotheses that guide roadmaps and operationalization of our capabilities. But how do we go about observing?
There are a few different monitoring strategies that can be leveraged to achieve our goals:
- Instrumenting your code directly, leveraging metrics and timers that are available within your given language or framework, commonly referred to as “observability”
- Using a paid Software-as-a-Service Application Performance Monitoring (APM) solution that allows you to easily install a seamless agent
- Using analytics, real user monitoring (RUM) or synthetic transactions to provide insight into the performance of an application from the user’s perspective
Regardless of the strategy, the end goal is the same: To gain visibility into your application from multiple perspectives to provide you with as much information as possible, as near to real-time as possible. Whichever approach you choose, it’s important to ensure that you can fundamentally state, without question, that you know how your application is performing and what your customer experience is.
When choosing which model to follow, however, there are several factors to consider, primary among them being your desired speed-to-market, your budget, and an honest assessment of the maturity of your organization. In project management, this is the typical “good, cheap, fast” trade-off – if you need something fast and good, it likely won’t be cheap. And if you want something cheap and good, it will not be fast to implement.
The ability to aggregate visibility is also key as you seek to measure specific variables against goals and understand the health of an application. By building dashboards around the key indicators that inform and alert you about operations and dependencies, you can pinpoint problems and identify the dependencies that are impacted – is it a problem with code, the application or a dependency?
Best Practice #3: Incident management
Incident management is a just nice way to say what it really is: Crisis management. Here’s where you want to control the chaos. With good monitoring and observability, you should know with relative confidence what is actually happening with applications and resources. But it doesn’t end there, because you also need to assemble a team to respond to the information and events those tools are generating. Teams need to be able to follow a framework and manage the incident successfully, because if incidents aren’t managed correctly, they will wreak havoc for your applications and environment.
There are three categories of events that escalate urgency and intervention:
- Alerts should be reserved for critical events that require immediate human intervention and mitigation, 24/7/365.
- Tickets are low-priority events that can be addressed during normal business hours
- Logs are events that are strictly to record an event has occurred, but does not require intervention
Early in the maturity timeline it’s common for a team member working on an incident to make a gut decision on which action to take. And it may work. But that doesn’t mean it’s the right action and it doesn’t mean that there wasn’t an action that would have been more specific to fix the problem. The difference between a well-managed incident and an unmanaged incident is the difference between making random – gut-influenced – changes that often result in a combination of making things better and worse, versus coordinated activity that follows the scientific method, resulting in specific action designed to follow or test a hypothesis which results in improving the situation.
It’s critical to follow the scientific method for everything you do in response to an incident. This will help ensure that you’re making the right decisions to fix the problem by controlling the variables and measuring the results of actions taken. Taking actions based upon a hypothesis of an expected outcome, whether it’s successful or not, helps eliminate potential scenarios and moves the process of repair forward.
Best Practice #4: Experiment
Once you are able to observe your systems and begin to form hypotheses, you can confidently begin conducting various experiments to validate your hypotheses and prepare to make changes, as necessary, with your application to improve performance, cost, or reliability.
In order to set up a good performance experiment you want to mimic how your customers are going to use your product, to the best of your ability. This will help set you up for success on 95% of your use-cases and customer interactions. However, it’s important to recognize that when it comes to customers, a familiar military adage perfectly describes usage in production: No battlefield plan survives first contact with the enemy. In other words, no matter how hard you try to predict exactly how a customer will use your site and its features, at the end of the day you can’t anticipate every scenario that will happen in production. We’ll see in the next section how to deal with those outliers, but don’t let that fact discourage you – non-production performance testing and experimentation are still critical.
For several years, “performance testing” has been a common step in the operationalization of an application or platform. However, far too often, it is seen as a necessary evil. Teams conduct a test without a goal in mind other than ensuring the application stays up past a certain point. On the other hand, more mature organizations form a specific prediction and hypothesis around how a configuration or application change might affect performance, then test that hypothesis with those specific goals in mind.
When determining how often to conduct performance experimentation, it’s critical to think about what the hypothesis is at a high level. If you consider that any change to a production environment has the potential to introduce risk or cause a performance degradation, then ideally you should conduct at least a basic performance test as part of every release. Integrating performance testing into your CI/CD pipeline as part of each change to your environment helps maintain confidence in your current configuration. Alternatively, if you make a change to the configuration itself, a performance test is also recommended in order to validate the change had its desired effect, when possible.
Experimentation is integral to validating hypotheses and making changes to improve application performance, cost, or reliability. Unfortunately, relying on humans alone to perform experiments isn’t the solution – you can’t put enough human resources toward experimentation to make a real difference. Fortunately, as we’ll see next, modern machine learning and automation capabilities can help.
Using chaos engineering to aid with experimentation
Chaos engineering allows us to take a disciplined approach to system integrity testing by simulating and identifying failures before they result in unplanned downtime or a poor user experience. It allows us to take our theory or hypothesis and build it in such a way that, if a service goes away or a connection gets broken, the system will survive. When it works, great. When it doesn’t, you learn that you need to figure out how to change the application or infrastructure so that it will do what you think it will do. If part of the service breaks, the rest will still work. This applies to data centers and networks, as well as the entire stack, to ensure that failovers and redundant paths are operating.
Best Practice #5: Optimize
Typically, teams configure an environment to hit a target of 99.9% availability based on the number of requests coming in at any given moment. In a Kubernetes environment, this often means sacrificing costs to achieve the level of performance you seek by requesting significantly more CPU or memory than is needed. There is a better alternative, however, and that’s where optimization comes into play.
To start optimizing we need to look once again to experimentation using performance testing – and the tweak and run cycle. Even the best enterprises are only running performance tests once per week – and barely any do even that. At that rate it would take you four years to run 208 performance tests and make 208 configuration changes on a single configuration as you test your application. This is pretty simple math – you can’t optimize at that rate. There aren’t enough bodies or budget to get the job done.
This is where automation and machine learning come in. Automation can help speed up the tweak and run cycle, running years worth of performance tests over the course of just a few hours, tweaking the configuration with each successive run to see how the application performs and what resource utilization results. Machine learning then provides the ability to analyze all those results and recommend configurations to try next in order to home in on the optimal configuration.
Using experimentation to optimize before deployment is valuable because you can try any scenario you can dream up, and iterate as many times as you’d like. But it’s important to also consider optimization during production because things change and, as we saw in the last section, it isn’t possible to account for every eventuality.
Because we are monitoring and observing our production environment (see Best Practice #2), we can also apply machine learning to that data to tweak and tune our configurations in production. Machine learning can look at actual usage trends to predict and recommend resource settings that will keep our environment running smoothly, minimizing resource usage while still ensuring performance that meets expectations.
By introducing new levels of automation and machine learning in both pre-production and production, teams can efficiently and effectively optimize applications and resources, and meet performance, reliability, and cost goals.
Kubernetes Optimization with StormForge
StormForge has introduced a platform that enables teams to optimize Kubernetes applications in pre-production and production. The platform includes both experimentation-based optimization (Optimize Pro), and observation-based optimization (Optimize Live). StormForge enables you to run experiments and tweak applications in pre-production to get as near your target as possible prior to launch, and then observe and validate applications in production for continuous verification and improvement.
Optimize Pro: StormForge Optimize Pro leverages machine learning for multi-dimensional optimization to enable intelligent business trade-offs. Optimize Pro works proactively through a process of rapid experimentation in your non-production environment, with built-in performance testing and predictive what-if analysis that allows you to understand how your application will perform and how much it will cost to run before deployment. This allows your team to deliver efficient and reliable applications without time-consuming trial-and-error tuning.
Optimize Live: StormForge Optimize Live leverages machine learning for continuous optimization of production environments. Optimize Live uses machine learning to analyze existing observability data and make recommendations for CPU and memory requests and limits to improve efficiency. Recommendations can be automatically implemented or manually approved. Optimize Live is easy to deploy and delivers fast time-to-value, leveraging data already being collected.
Conclusion
Every organization’s journey to maturity is as unique as its business. But, there is one thing all organizations share: a goal to achieve the highest level of service possible given the resources and skills of its team. Regardless of where your operations and team are on this journey, the five SRE best practices we’ve outlined will help you create a more resilient environment for operational maturity and a better customer experience.
Five SRE Best Practices for the Modern Enterprise
- Engage early
- Monitor and observe
- Incident management
- Experiment
- Optimize
Explore StormForge
Discover how StormForge can help automatically improve application performance and cost efficiency in cloud native environments.