This blog originally appeared on The New Stack.

Earlier in my career as an IT consultant, I was involved in performance testing for a large retailer before the integration of inventory, distribution and fulfillment systems for online channels. As you can imagine, the peak holiday season was critical, both in terms of maximizing revenue and meeting customer expectations for service.

The retailer had implemented a fairly new (at the time) omnichannel order management system and needed to ensure that it could withstand the stress that a high volume of orders would create. Our team needed to test systems, identify areas to optimize and minimize the time it took to do that with speed and urgency as if the business depended on it, because it did.

Fast forward to now, and the challenge we faced then still exists today: How to ensure optimal performance and resource utilization without introducing duplication of efforts, inefficiencies, costs — and business risk.

Or, to put it simply, how can we learn about our critical business systems better so that we can ensure confident system operation and risk forecasting?

Learning from What You See or What You Control

We have two primary ways of acquiring knowledge: through observation or through experimentation. Either we observe how our systems are behaving, look at the data for clues and hopefully come to accurate conclusions, or we experiment in a controlled environment with a full ability to manipulate input values, make note of output values and draw more confident conclusions to get meaningful outcomes.

With each method, the end game is the same: We want to be sure of our conclusions so we can make informed decisions based on what we feel is accurately acquired knowledge. However, unlike experimentation, an observational approach to learning increases your chances of not only being wrong, but of also worsening the situation you set out to improve in the first place.

The process is at the mercy of the individual, their skill set and their ability to draw the correct conclusions. Observational learning is not only not scalable, it’s also inherently risky and can waste valuable time trying to get it right. On the other hand, controlled experimentation enables you to learn quickly and cost-effectively without introducing personal bias.

As IT professionals, we participate in both types of learning during our day-to-day activities. For example:

  • Outage troubleshooting: where teams work to investigate logs and metrics to uncover knowledge about why a system failed in order to avoid future outages.
  • Application debugging: to discover erroneous behavior due to faulty code logic or unanticipated corner cases, after which initial hypotheses are formed and experiments are run to test/verify the hypotheses.
  • Performance and cost optimization: via experiments for given scenarios using a fully controlled feedback loop of iterative tuning (inputting variables, measuring outcomes and adjusting input variables to converge on desired outcomes).

Each of these activities creates an opportunity to add to an organization’s collective knowledge base, but not all of them lead to undisputed outcomes,  or at least not as quickly as you’d like or with low stress. The key is determining the most effective approach to gaining knowledge and then using that learning to inform actions to improve the accuracy in the conclusions that result from the activity.

Get Started with StormForge

Try StormForge for FREE, and start optimizing your Kubernetes environment now.

Start Trial

Observational vs. Experimental Research

Although the processes and tools differ, both observational and experimental research are established and accepted ways to gain knowledge about the world around us.

  • Observational Research: With no control over variables or experiment design, observational research relies on best-effort analysis based upon an observer’s logic skills. Independent variables can be difficult to isolate, making it harder to establish cause-and-effect relationships while also introducing susceptibility to bias and incorrect correlations. This approach is often time-consuming, however it may sometimes be the only choice. For IT teams, this approach can result in inefficient and reactive problem-solving, for example in the cases of outage troubleshooting and application debugging. Without a way to confidently predict the outcome of the team’s efforts, problems will arise before solutions are in place.
  • Experimental Research: Unlike observational research, an experimental approach enables significantly greater control, with experiments designed for full variable isolation. In this way, a properly designed experiment can provide conclusive and confident results. This approach enables IT teams to proactively improve systems and processes, such as application debugging and optimization. However, to take advantage of an automated experimental research approach, organizations need a convenient, cost-effective solution.

For IT teams, an experimental research approach greatly increases the likelihood of learning under preferred conditions. Choosing this approach gives you the advantages of a scientific approach to IT challenges for proactive, formal experiments with full control of variables. While reactive observational research situations are unavoidable, they can and should be used to form a hypothesis for future experimental research.

Part of the challenge for IT in the past was that the tools didn’t exist to help them easily implement and adopt an experimental research approach for IT infrastructure. The sheer number of variables and the manual effort required to effectively manipulate configurations didn’t allow teams to scale this approach. The addition of cloud platforms exponentially increases the number of variables in play, making it nearly impossible and too costly to apply experimental research to modern infrastructure. Instead, IT teams overprovision resources to ensure performance and capacity goals are met even if that results in skyrocketing costs.

Modern Tools Accelerate Learning for Better Decision-Making

Automation and new tools (such as our own here at StormForge) have changed the game. These solutions help IT proactively acquire knowledge to drive outcomes such as ensuring efficiency and intelligent business trade-offs between cost and performance without time-consuming, ineffective trial and error. With the addition of automation, rather than manual operation, multiple variables can be manipulated at once rather than one at a time. Machine learning can also be used to sidestep logical fallacies and establish strong correlation/causation links.

Observation is still imperative for IT, and observability tools play a vital role in the collection of data. But tools like ours go beyond troubleshooting to enabling experiments that empower engineers to proactively make smart resource decisions that minimize the cost of running applications, and the time and effort spent making those decisions, while ensuring business goals are met. Now IT can dramatically reduce the time it takes to get answers and do it before problems arise. It doesn’t take a scientist to see the value that brings to the IT equation, and to the business as a whole.