Amazon's Practice: The Difficulty of Distributed Systems

2024.09.27

Problem 1: Non-standardization of heterogeneous systems

There are many standardization issues in the process of software development and operation and maintenance. For example, there is a lack of unified standards for software and applications, communication protocols and data formats vary, and the processes and methods of development and operation and maintenance are inconsistent. Different software and programming languages ​​bring different compatibility issues, as well as different development, testing, and operation and maintenance standards. This difference leads to different methods in development and operation, which increases the complexity of the architecture.

For example, some software needs to modify the configuration by modifying the .conf file, while others need to call the management API interface. In terms of communication, different software uses different protocols, and even in the same network protocol, different data formats may appear. Different technical teams use different technology stacks, which will also lead to diversified development and operation methods. These differences will make our distributed system architecture extremely complex. Therefore, the distributed system architecture needs to formulate corresponding specifications.

For example, I noticed that many service APIs do not use HTTP error status codes when returning errors, but return HTTP 200, and then include error information like "error, bla bla error message" in the JSON string in the response body (HTTP Body). This design is not only counterintuitive, but also greatly affects the implementation of monitoring. In fact, now we should use specifications such as Swagger to standardize APIs.

Another example is configuration management. Many companies only use a simple key-value format in software configuration management. Although this flexibility is high, it is also easy to be abused. For example, non-standard configuration naming, non-standard values, and even directly embedding front-end display content in the configuration. A reasonable configuration management should be divided into three layers: the bottom layer is related to the operating system, the middle layer is related to the middleware, and the top layer is related to the business application. The configuration of the bottom and middle layers should not be modified by users at will, but should be selected through templates. For example, the configuration related to the operating system should form a standard template for users to choose, rather than modify it at will. Only when configuration management is standardized can we effectively control the complexity of various systems.

For example, a data communication protocol usually includes a protocol header and a protocol body. The protocol header defines the basic protocol data, while the protocol body contains business data. We need to establish strict standards for the definition of the protocol header so that all teams using the protocol follow the same set of rules. This standardization can help us better monitor, schedule and manage requests.

Through these standardization measures, we can better control the complexity of distributed systems and improve the maintainability and stability of the system.

Problem 2: Service dependency issues in system architecture

In traditional monolithic applications, if a server goes down, the entire application stops running. However, this does not mean that similar problems will not occur in a distributed architecture. In a distributed system, there are usually dependencies between services. When a service in the dependency chain fails, it may trigger a "domino effect" and lead to a series of chain reactions. Therefore, in a distributed architecture, service dependencies may also cause many problems.

A typical situation is that non-critical services are relied upon by critical services, resulting in non-critical services becoming as important as critical services. The "barrel effect" often occurs in the service dependency chain, that is, the service level agreement (SLA) of the entire system is determined by the service with the worst performance. This falls into the scope of service governance. Effective service governance requires not only that we define the importance of each service, but also that we clearly define and describe the main call paths of critical services. Without such governance and management measures, it will be difficult for us to effectively operate and manage the entire system.

It is important to note that although many distributed architectures achieve business isolation at the application level, they do not do so at the database level. If a non-critical business causes the entire database to be overloaded due to a database problem, it may drag down the entire system and cause the entire site to be unavailable. Therefore, corresponding isolation is also required at the database level. The best practice is to use an independent database for each business line. This is also one of the practices of Amazon servers: systems are prohibited from directly accessing each other's databases and can only interact through service interfaces. This approach meets the requirements of the microservice architecture.

In short, we not only need to split the services, but also allocate independent databases for each service to avoid interference between different services. Only in this way can we truly achieve business isolation and improve the reliability and stability of the system.

Problem 3: The probability of failure is greater

In a distributed system, due to the large number of machines and services used, the probability of failure is higher than that of a traditional monolithic application. Although distributed systems can reduce the scope of failure through isolation, the frequency of failure is higher due to the large number of components and complex structure. On the other hand, due to the difficulty of management and the difficulty in grasping the overall picture of the system architecture, errors are more likely to occur. This is almost a nightmare challenge for the operation and maintenance of distributed systems. Over time, we will gradually realize the following points:

  1. Failures are not terrible, but long recovery times are the real problem. In distributed systems, failures are almost inevitable, but if the recovery time is too long, it will have a serious impact on the business.
  2. Failures are not terrible, but the impact is too large. When designing distributed systems, we need to control the impact of failures as much as possible to avoid chain reactions caused by single point failures.

Operations teams face tremendous pressure in distributed systems, dealing with various failures of varying sizes almost all the time. Many large companies try to deal with these problems by adding a large number of monitoring indicators, sometimes even setting up tens of thousands of monitoring points. But this approach is actually a "brute force" strategy. On the one hand, too much information will lead to information overload, making it difficult to obtain valuable insights; on the other hand, the service level agreement (SLA) requires us to define key metrics, which are the most important performance and status indicators. However, many companies ignore this in actual operations.

This practice reflects the laziness of operation and maintenance thinking, because it only focuses on the "firefighting stage" rather than the "fire prevention stage". As the saying goes, "fire prevention is better than firefighting", we need to consider the occurrence of failures in advance when designing and operating systems, which is the so-called "design for failure". In system design, we should consider how to mitigate the impact of failures. If failures cannot be completely avoided, they should be restored as soon as possible through automated means to minimize the impact.

As the number of machines and services in the system increases, we will find that human limitations gradually become a bottleneck for management. Humans cannot manage complex systems in an all-round way, so automation becomes particularly important. "People manage code, code manages machines, and people do not directly manage machines." This concept means that we should manage the complexity of distributed systems through automation and code governance, focusing the role of people on managing code and policies, and leaving specific system operations to automated systems. This approach can not only improve the stability and controllability of the system, but also effectively reduce the burden on the operation and maintenance team.

Problem 4: Multi-layer architecture has greater O&M complexity

Generally speaking, we can divide the system into four levels: basic layer, platform layer, application layer and access layer.

  1. The base layer includes infrastructure such as machines, networks, and storage devices.
  2. The platform layer refers to the middleware layer, which includes software such as Tomcat, MySQL, Redis, and Kafka.
  3. The application layer includes various business software and functional services.
  4. The access layer is responsible for the access of user requests, such as gateway, load balancing, CDN and DNS.

For these four layers, we need to make one thing clear: problems at any layer will affect the operation of the entire system. Without a unified view and management mechanism, the operation and maintenance of each layer will be fragmented, leading to greater complexity.

Many companies divide their teams according to their skills, such as product development, middleware development, business operation and maintenance, system operation and maintenance, etc. This division of labor will cause each team to focus only on the part they are responsible for, resulting in a lack of coordination and poor information flow in the entire system. When a problem occurs in a certain link, the entire system is like a "domino effect". One failure will trigger a chain reaction and the scope of impact will continue to expand.

Due to the lack of a unified operation and maintenance view, the team cannot clearly understand how a service call flows through each layer and resource. Therefore, when a failure occurs, locating the problem and communicating and coordinating consumes a lot of time and energy. I have encountered a similar situation when I was working on a cloud platform before: from the access layer to the load balancing, to the service layer and the operating system layer, the KeepAlive parameter settings of each link were inconsistent, resulting in the actual running behavior of the software not being consistent with the document description. Engineers were repeatedly frustrated in the process of troubleshooting, thinking that they had solved one problem, but new problems arose. After repeated troubleshooting and debugging, all KeepAlive parameters were finally set to be consistent, which took a lot of time.

It can be seen that the division of labor itself is not a problem, but whether the collaboration after the division of labor is unified and standardized. This is particularly worthy of attention. In system operation and maintenance, ensuring coordination and consistency between various levels, standardized management, and a grasp of the overall view of the system are the keys to avoiding operation and maintenance chaos, improving efficiency, and system stability.