Six lessons for keeping the system simple
At AWS re:Invent, CTO Werner Vogels shared some of Amazon's fundamental lessons learned in system design.
Translated from Werner Vogels' 6 Lessons for Keeping Systems Simple by Joab Jackson.
"Entities should not multiply without cause" – William Occam, Occam's razor
Not to mention Amazon Web Services, the cloud giant has been able to scale on its systems and services for more than two decades, constantly introducing new features while keeping the user experience manageable.
In his customary Thursday keynote, AWS CTO Werner Vogels shared some of the lessons he learned from his two decades of working at cloud giants about keeping things simple.
Complexity always creeps into system design, so engineers must manage this complexity diligently.
The goal has always been to "scale our system in a secure way to make it more complex over time," he says.
From nimble to deadly
The goal is not to eliminate complexity, but to manage it effectively.
The danger when enhancing any system is that it introduces unnecessary complexity, which is difficult to maintain, and deprives users of fun.
But as Larry Tesler points out, complexity can't be eliminated, it's just transferred.
Now, there is good complexity, which is added to help the system evolve; Then there's the kind of complexity that happens without architectural oversight, which only slows down users and makes the system more difficult to maintain, he explains.
Complexity depends not only on the number of components in the system, but also on how those components are arranged.
"You can't reduce the complexity of a given task below a certain point. Once you reach this point, you can only transfer the burden. "——Larry Tesler
Here, Vogels uses bicycle design as an example.
In fact, the simplest form of a bicycle is a unicycle, which has only one wheel. Unicycles are very nimble ("they can turn in place"), but they are difficult to ride.
On the other hand are trikes, which are easy to ride but bulky and difficult to maneuver, as anyone who has ever ridden a trike knows.
The best design is actually a two-wheeled bike, which combines ease of use and flexibility.
"The bike has more components, but from a holistic point of view, it's in its simplest form," he says. It's no accident that bicycles are the most popular form of bicycles.
Put complexity where it's needed.
An example of this is Amazon S3 (Simple Storage Service), which only provides eventual consistency, not strong consistency. This means that if a customer buys an Amazon S3 bucket, it will eventually be available, but it may not be available immediately.
However, customers wanted strong consistency and had begun to build workarounds to ensure strong consistency based on S3 in their own designs, which led to unnecessary complexity in their own systems.
AWS ultimately redesigned S3 to achieve strong consistency, "moving complexity to where it's needed," away from the customer.
So, how do system engineers manage complexity in their own systems?
Vogels offers six tips:
1. Build a system that can grow
Software systems that don't move forward will die. Even the most stable software will see the world move forward without it.
Also, who doesn't want to see their favorite apps have new features or run faster?
Your system will evolve over time, and you'll need to modify your architecture, Vogels advises.
"Every time you change orders of magnitude, you need to revisit your architecture," Vogels says. When S3 was launched, design engineers knew they were going to change the architecture. Over time, it has amassed an astonishing range of features despite the small footprint of its API.
"You'll see that every year we add new features without any impact on what we offer to our customers," he said.
The same is true for web services, which have evolved significantly at the request of users.
"We knew that whatever the networking capabilities we offered you in 2006 would definitely change radically in 2010 in 2020," Vogels said.
You'll need to develop a strategy to deal with complexity, which will grow in your system over time.
Evogenability is a prerequisite for managing complexity, Vogels says.
2. Break the complexity
Complexity can sneak into your application, like the proverbial calories in frog soup.
"Small changes may seem easy to manage and absorb at first, but if you ignore the warning signs, the system becomes more complex and more difficult to manage and understand," he said. The answer is to break down the system into multiple, more manageable components.
Amazon CloudWatch started out as a simple monitoring service. As more features were added, each one was loaded onto the landing page, until the page became very busy, and AWS redesigned it so that it only kept the core functionality. Other features are migrated to their own environments.
How big should the service be? It should fit in an engineer's brain.
"If you can't remember it, your service is usually too big," Vogels says.
3. Align the architecture with business needs
Vogels says that business-focused components with "intelligent endpoints" and "fine-grained interfaces" are built with sophisticated APIs. Disperse them so that they "develop independently".
Enterprise technology isn't built for itself. It's built with the customer in mind. As a result, system architects need to work closely with the business units they serve.
Collaboration is crucial. The business might say it needs 100% uptime. It's doable, but it's costly. As a result, system designers may need to indicate how expensive 100% reliability will be.
"Then you can have a conversation," he said.
Vogels once said that everything will fail all the time. So the trick is to plan to fail.
4. Organize your work into units
As the app grows in popularity and features, it also creates complexity in how it operates. Consider using a unit-based architecture to keep this growing complexity simple.
"The time to build a system is often much less than the time to run it. Therefore, it is critical to invest in manageability in advance," says Vogels.
Managing these operations must also be broken down into smaller building blocks. This is to reduce the scope of impact, which is essential to minimize downtime.
"Units create order in complex systems," he said. "They isolate the problem to a specific cell without affecting other cells."
A router and control plane are required to route requests to individual units. Routing labels can be based on Region ID, Host ID, Customer ID.
"Breaking it down into units over time will help you maintain the reliability and safety of your customers," he said.
5. Design predictable systems
Uncertainty is difficult to deal with. So, design your system ahead of time to reduce uncertainty.
AWS runs a hyperplane for its customer-facing load balancers to handle all the changes that millions of customers use to change configurations.
Surprisingly, AWS doesn't use an event-driven architecture to set up this service, which, contrary to popular belief, is a poor approach to this task because the rate of requests for load from users will be unpredictable.
Instead, AWS writes the changes to S3 files, which are fetched by the load balancer at regular polling intervals.
"Simplicity requires discipline" – Werner Vogels, CTO at AWS
"It's a model we call continuous work," Vogels says. This approach avoids spikes, backlogs, and bottlenecks, and also enables the system to self-heal because "S3 is impermeable." ”
AWS's Route 53 domain name service uses the same principle on its health checker: polling instead of queuing.
6. Automate everything
To manage complexity, automate it.
AWS uses automation to accomplish many tasks, even including building new regions, which is fully automated.
It's not a question of what to automate, but what not to automate. Only those decisions that really require human involvement should have human intervention. Everything else should be automated.
"Automation should be the standard, with the exception being that we need human involvement," says Vogels. "Manual entry should only be needed in areas where human judgment is really needed."
Security is a highly automated process at AWS that includes processes such as automated threat intelligence. AWS receives "trillions" of DNS change requests every day, identifying at least 100,000 malicious domain names every day, through an automated process that can't be done manually.
Support tickets are another area that lends itself to automation through proxies. Proxies are best suited for very narrow use cases where they are given a set of tools that can solve problems through a process called "serverless prompt chaining".
If the agent is unable to resolve the issue, simply bring it to the attention of a human.
"Automating everything that doesn't require a high degree of judgment," Vogels says.