Let's Talk About the Three Rules of Software Architecture Scalability

For most commercial and government systems, scalability is not a primary requirement in the early stages of development and deployment. Usability and useful new features are the main drivers of the development cycle. As long as the system can meet normal load, new features will continue to be added to enhance the business value of the system. However, when the system develops to a certain stage, performance and scalability will become urgent issues and even affect the survival of the system. In this case, the architect has the responsibility to evolve the system into a responsive and scalable system.

The relationship between cost and scalability

One of the core principles of system scalability is the ability to easily add new resources to handle increased load. For many systems, a simple and effective approach is to deploy multiple stateless server instances and use a load balancer to distribute requests between them, as shown in the following figure.

If these resources are deployed on a cloud platform, the cost includes: the virtual machine deployment cost of each server instance, and the load balancer cost determined by the number of new and active requests and the amount of data processed.

In this scenario, as the request load increases, the deployed virtual machines need more processing power, which leads to higher costs. At the same time, the overhead of the load balancer also grows proportionally with the request load and data size. Therefore, cost and scale are interrelated, and design decisions for scalability will inevitably affect the cost of deployment. If you ignore this, you may receive an unexpectedly large deployment bill.

So, how can we reduce costs? There are two main ways. First, use an elastic load balancer to adjust the number of server instances based on real-time request load. Second, increase the capacity of each server instance, which is usually achieved by adjusting server deployment parameters (such as the number of threads, number of connections, heap size, etc.). Carefully selected parameter settings can significantly improve performance and thus increase capacity.

Be aware of system bottlenecks

Scaling a system essentially means increasing its capacity. In the above example, we increased the request processing capacity by deploying more server instances. However, software systems are made up of multiple interdependent processing elements or microservices, so when increasing the capacity of one microservice, it is inevitable that some other microservices will be constrained. In our load balancing example, assume that the server instances are all connected to the same shared database. As the number of deployed servers increases, the request load on the database also increases (see the figure below).

At some point, the database will reach capacity and database access will start to experience significant latency. At this point, the database becomes a bottleneck - adding more server processing power will not help. To scale further, you need to increase the capacity of the database in some way. You can try to optimize queries, add more CPUs or memory, or replicate or shard the database. There are many other solutions. Any shared resource in the system can become a bottleneck.

When adding capacity to parts of your architecture, you need to carefully consider the capacity downstream to ensure that you don’t suddenly impact the system. This can quickly lead to cascading failures (see the next rule) and bring down the entire system. Databases, message queues, long-latency network connections, thread and connection pools, and shared microservices are all potential bottlenecks. You can bet that high traffic loads will quickly expose these bottlenecks. The key is to prevent the system from suddenly collapsing when bottlenecks are exposed, and to quickly deploy more capacity.

Slow service is more harmful than broken service

Under normal circumstances, the system should be able to provide stable and low-latency communication for microservices and databases. When the system load is at a normal configuration level, the performance is predictable, consistent, and fast, as shown in the following figure.

Once client load exceeds normal levels, request latency between microservices will begin to increase. In particular, if the incoming request load continues to exceed capacity (such as service B), uncompleted requests will accumulate in microservice A, which will receive more requests than completed due to slower downstream latency.

When a service becomes overwhelmed by jitter or resource exhaustion, the service becomes unresponsive to clients and clients become stuck. The direct result is cascading failures—slow services cause requests to pile up along the request path until the entire system crashes. Architectural patterns such as circuit breakers and bulkheads can be used to prevent cascading failures. Circuit breakers throttle request loads or even disconnect a service if its latency exceeds a specified value. Bulkheads protect upstream microservices from failures when only one downstream dependency fails. These measures can be used to build resilient and highly scalable architectures.