Backpressure in Distributed Systems

2024.10.30

Research has shown that even the strongest, most well-designed dams cannot withstand the destructive power of an uncontrolled flood. Similarly, in the context of distributed systems, unrestricted callers often overwhelm the entire system and cause cascading failures. Without proper protection, a retry storm has the potential to bring down the entire service. This article will explore when a service should consider applying back pressure to its callers , how to apply it, and what callers can do to handle back pressure.

Backpressure

As the name implies, backpressure is a mechanism in distributed systems that refers to the ability of a system to limit the rate at which data is consumed or produced to prevent itself or its downstream components from being overloaded. The backpressure a system applies to its callers is not always explicit, such as in the form of throttling or load shedding, but sometimes implicit, such as by increasing the latency of service requests without explicitly slowing down its own system. Both implicit and explicit backpressure are intended to slow down the caller, either because the caller is performing poorly or because the service itself is in a bad state and needs time to recover.

Backpressure Required

The following is an example of when a system needs to apply back pressure. In this example, a control plane service is being built with three main components: a front end that receives client requests, an internal queue that buffers client requests, and a consumer application that reads messages from the queue and writes to a database for persistence.

(1) Mismatch between producers and consumers

Imagine a scenario where participants /clients access the frontend so frequently that the internal queue is full or the worker threads writing to the database are busy , causing the queue to be full. In this case, requests cannot be queued, so rather than abandoning customer requests, the customer is notified in advance. This mismatch can happen for a variety of reasons, such as a surge in incoming traffic or a minor glitch in the system where the consumer service was down for a while , but now additional work hours must be added to effectively clean up and resolve the backlog of work that was built up during the downtime.

(2) Resource constraints and cascading failures

Imagine a scenario where the queue is close to 100% of its capacity, while it is usually around 50%. To match this increase in incoming rate, the consumer application can be scaled up and start writing to the database at a higher rate. However, the database crashes because it cannot handle this increase (e.g., the limit on the number of writes per second ). This failure will bring down the entire system and increase the mean time to recovery (MTTR). In this case, applying backpressure at the right places becomes critical.

(3) Missing the Service Level Agreement (SLA)

Consider a scenario where data written to a database is processed every 5 minutes and another application listens to this data to keep itself updated . Now, if the system cannot meet the SLA for some reason, such as the queue is 90% full and it may take 10 minutes to clear all the messages, then it is best to apply backpressure. The customer can be notified that they will miss the SLA and advised to try again later, or backpressure can be applied by removing non-urgent requests from the queue to meet the SLA for critical events/requests.

The Challenge of Backpressure

Based on the above, it seems like backpressure should always be applied, and it sounds like that is indeed the case, the main challenge is not whether backpressure should be applied, but how to determine the right points to apply backpressure, and the mechanism to apply backpressure to meet specific service/business requirements.

Backpressure forces a trade-off between throughput and stability, and the challenge of load prediction further complicates this trade-off.

Determine the Back Pressure Point

(1) Find bottlenecks/weak links

Every system has bottlenecks. Some bottlenecks are self-sustaining and self-protecting, while others are not. Imagine a system where a large data platform cluster (thousands of hosts) relies on a small control plane cluster (less than 5 hosts) to receive configurations stored in a database, as shown in the figure above. The large cluster can easily overwhelm the small cluster. In this case, to protect itself, the small cluster should have a mechanism to apply back pressure to the caller . Another common weak link in the architecture is the centralized components that make decisions for the entire system , such as anti-entropy scanners. If they fail, the system will never reach a stable state and may even cause the entire service to crash.

(2) Using system dynamics: monitors/indicators

Another common way to find backpressure points for a system is to set up appropriate monitors/metrics . Continuously monitor system behavior, including queue depth, CPU/memory utilization, and network throughput. Leverage this real-time data to identify emerging bottlenecks and adjust backpressure points accordingly. Creating a comprehensive view through metrics or observers (such as performance canaries across different system components) is another way to understand whether the system is under stress and should apply backpressure to its users/callers . These performance canaries can be isolated to different aspects of the system to find bottlenecks. In addition, having a real-time dashboard of internal resource usage is another good way to leverage system dynamics to discover critical points and take more proactive measures.

(3) Boundary: Principle of Least Surprise

Most obvious to clients are the service surface areas where they interact. Typically , these are the APIs that clients use to service their requests . This is also where clients are least surprised when backpressure occurs, as it is a clear indication that the system is under stress. It can take the form of throttling or load shedding. The same principles can be applied within the services themselves, across the different subcomponents and interfaces through which they interact with each other. These surfaces are the best places to apply backpressure, helping to minimize confusion and make the system behave more predictably.

How to Apply Backpressure in Distributed Systems

In the previous section, we discussed how to find the right points of interest to apply backpressure. Once these points have been identified, here are some ways to apply backpressure in practice .

Building Explicit Flow Control

The idea is to allow the caller to see the size of the queue and control the call rate based on it. By knowing the queue size (or whatever resource is becoming a bottleneck), the caller can increase or decrease the call rate to avoid overloading the system. This technique is particularly useful in situations where multiple internal components work together and try to minimize impact on each other. The following formula can be used at any time to calculate the caller's rate. Note: The actual call rate will depend on a variety of other factors , but the following formula should be able to give a good idea.

CallRate_new = CallRate_normal * (1 - (Q_currentSize / Q_maxSize))

Reversal of Responsibility

In some systems, it is possible to change the order in which the caller does not send requests directly to the service, but instead lets the service request work itself out when it is ready to be serviced. This technique gives the receiving service full control over how much it can do, and can dynamically change the request size based on its latest state. A token bucket strategy can be employed, where the receiving service fills up tokens and tells the caller when and how many tokens they can send to the server. Here is an example algorithm that a caller can use:

# Service requests work if it has capacity
 if Tokens_available > 0: 
 Work_request_size = min (Tokens_available, Work_request_size _max) # Request work, up to a maximum limit 
 send_request_to_caller(Work_request_size) # Caller sends work if it has enough tokens
if Tokens_available >= Work_request_size: 
send_work_to_service(Work_request_size)
 Tokens_available = Tokens_available – Work_request_size
# Tokens are replenished at a certain rate
Tokens_available = min (Tokens_available + Token_Refresh_Rate, Token_Bucket_size)

Active Adjustment

Sometimes, knowing ahead of time that a system will soon be overwhelmed, proactive measures can be taken, such as asking callers to reduce their call volume and then slowly increase it . Imagine a scenario where a downstream service is down and rejects all requests. During this time, all work is queued in a queue, which is now ready to be emptied according to the SLA. However, if the queue is emptied at a faster-than-normal rate, it may cause the downstream service to fail. To solve this problem, you can proactively limit the number of requests from callers, or communicate with callers to reduce their call volume and slowly relax the limit.

Current Limitation

Limit the number of requests that a service can handle and drop requests that exceed this number. Rate limiting can be implemented at the service level or at the API level. This rate limiting is a direct feedback to the caller, prompting them to reduce the number of calls. Priority rate limiting or fair rate limiting strategies can be further adopted to ensure that the impact on customers is minimized.

Load Shedding

Throttling is the act of dropping requests when certain predefined limits are violated. Client requests can still be dropped if the service is under too much pressure and decides to proactively drop requests that it has already committed to serving. This behavior is usually a last resort for the service to protect itself and let the caller know about it.

In Conclusion

Backpressure is a significant challenge in distributed systems that can severely impact system performance and stability. Understanding the causes and consequences of backpressure, as well as mastering effective management techniques, are critical to building robust and high-performance distributed systems. If implemented properly, backpressure can enhance system stability, reliability, and scalability, thereby improving the user experience. If not handled properly, it can erode customer trust and even lead to system instability. Proactively addressing backpressure through careful system design and monitoring is key to maintaining system health. While implementing backpressure may involve some trade-offs, such as a possible impact on throughput, the benefits are significant in terms of overall system resilience and user satisfaction.

Original title: Backpressure in Distributed Systems , Author: Rajesh Pandey

NEWS