How Enterprises Can Prevent Cloud Disasters

2024.10.22

Large companies work very hard to ensure that their services don't fail, for the simple reason that major outages can damage the brand and drive customers to competing products with a better record of stability.

Building reliable Internet services is a complex technical problem, but it’s also a human challenge for company leaders. The difficulty in motivating engineering teams to work on reliability is that such work is often viewed as less sexy than developing new features.

Incentives rule the roost in large-scale operations. Top tech companies employ tens of thousands of people and run hundreds of Internet services. Over the years, they have come up with clever ways to ensure that engineers build reliable systems. This article discusses the people management techniques that the most successful tech companies in history have adopted in large-scale environments, and whether you are an employee or a leader, you can apply these techniques to your company.

Spin the Wheel of Fortune

AWS's Operations Review is a weekly meeting open to the entire company. At each meeting, a "lucky wheel" is turned to randomly select one of hundreds of AWS services for a real-time review. The selected team must answer pointed questions about dashboards and metrics from experienced operations leaders. The meeting is attended by hundreds of employees, dozens of directors, and several vice presidents.

This incentivizes each team to have basic operational capabilities. Even if the probability of a team being selected is low (less than 1% in AWS), as a team manager or technical lead, you certainly don’t want to appear clueless in front of half the company, especially on your “bad luck” day.

It is important to review reliability metrics regularly. Leaders who take an interest in the health of their operations set the tone for the entire business. “Spinning the Wheel of Fortune” is just one of the tools to achieve this goal.

But what should you do during these operational reviews? That brings us to the next key point.

Set Quantifiable Reliability Goals

You might want “high uptime” or “five nines” (99.999% availability), but what does that mean to your customers? Real-time interactions (like chat) have much lower latency tolerance than asynchronous workloads (like training a machine learning model, uploading a video). Your goals should reflect what your customers care about.

When reviewing your team’s metrics, have them describe quantifiable reliability goals. Make sure you understand and make it clear to them why they chose those goals, and then have them use dashboards to demonstrate that they were achieved. Setting quantifiable goals helps you prioritize reliability efforts in a data-driven way.

It’s important to focus on problem detection. If you see anomalies on their dashboard, ask them what the problem is and whether their on-call staff has been notified. Ideally, you should be aware of the problem before the customer does.

Embrace the Chaos

One of the most revolutionary shifts in thinking about cloud resilience is injecting failures into production environments. Netflix formalized this concept as “chaos engineering” — and it’s as cool as its name.

Netflix wanted to incentivize its engineers to build fault-tolerant systems, rather than achieve this through micromanagement. They believed that engineers would be forced to build fault-tolerant systems if systemic failures were the norm rather than the exception. While it took some time to achieve this, at Netflix, production environments ranging from single servers to entire availability zones are routinely "knocked out". Each service is expected to automatically absorb these failures without affecting service availability.

This strategy is expensive and complex, but if you are releasing a product where high uptime is absolutely essential, injecting failures into production is a very effective way to get something like a "proof of correctness". If your product requires this, introduce this strategy early. It won't be easier or cheaper in the future.

If chaos engineering seems a bit too radical, at least require the team to conduct a "dry day" (simulated outage drill) once or twice a year, or before any major feature launch. During the dry day, there will be three designated roles - the first role simulates the outage, the second role fixes it without prior knowledge of the problem, and the third role observes and takes detailed notes. Afterwards, the entire team should get together to review the simulated incident (see below). Dry days will not only reveal the shortcomings of the system in handling outages, but also expose gaps in the engineers' response to these problems.

Establish a Strict Review Process

A company's review process reflects its culture. Top tech companies require teams to write reviews of major outages. The reports should describe the incident, explore the root causes, and suggest preventive measures. Reviews should be rigorous and held to a high standard, but the process should not blame individuals. Review writing is a corrective action, not a punitive action. If an engineer made a mistake, there was an underlying problem that allowed it to happen. Maybe you need a better testing process or better protection for critical systems. Dig deep into those systemic vulnerabilities and fix them.

Designing a robust review process could be an article in its own right, but what is certain is that having such a process in place will significantly reduce the chances of downtime the next time.

Rewarding Reliability Work

If engineers believe that only developing new features will lead to pay raises and promotions, reliability work will fall by the wayside. Most engineers, regardless of seniority, should contribute to operational excellence. Reward reliability improvement work in performance reviews. Hold senior engineers accountable for the stability of the systems they oversee.

While this advice may seem obvious, it’s easy to overlook.

Conclusion

This article explores some essential tools for building reliability into your company culture. Startups and early-stage companies don’t often prioritize reliability. This is understandable—your company must focus on validating product-market fit to ensure survival, yet once you have repeat customers, your company’s future will depend on maintaining trust. Humans earn trust through reliability, and the same is true for Internet services.