Five tips for building highly scalable cloud-native applications

How to improve the performance, availability and cost efficiency of the Apache Kafka engine through cloud-native design? Focusing on this central topic, the author of this article introduces the redesign process and key technological innovations of an event stream processing platform called Kora. The article begins by pointing out the must-have features of a successful cloud-native platform, including multi-tenant support, ease of scalability, data-driven management, secure isolation across customers, and support for rapid innovation. Kora is designed to meet these needs while maintaining compatibility with existing Kafka protocols and providing a consistent experience across multi-cloud environments.

When we set out to rebuild the engine that hosts the core of the Apache Kafka service, we knew we had to meet several unique requirements for a successful cloud-native platform. These systems need to be multi-tenant from the ground up, easily scale to serve thousands of customers, and be managed primarily by data-driven software rather than human operators. They should also ensure strong isolation and security across customers in an environment where engineers can continue to innovate quickly, even with unpredictable workloads.

Last year, we demonstrated a redesign of the Kafka engine. Much of what we design and implement will also be applicable to other teams building large-scale distributed cloud systems, such as databases or storage systems. We hope to share our experiences with the wider community in the hope that they will be helpful to those working on other projects.

1. Key considerations for Kafka engine redesign

Our high-level goals are likely similar to those of your own cloud-based systems: improve performance and resiliency, increase cost efficiencies for ourselves and our customers, and provide a consistent experience across multiple public clouds. We have an additional requirement to remain 100% compatible with the current version of the Kafka protocol.

Our redesigned Kafka engine, called Kora, is an event stream processing platform running tens of thousands of clusters in more than 70 regions on AWS, Google Cloud, and Azure. You may not achieve this scale immediately, but many of the techniques described below still apply.

Here are five key innovations we implemented into the new Kora design. If you want to learn more about any of these, we published a white paper on this topic that won the Best Industry Paper Award at the 2023 International Very Large Database Conference (VLDB).

2. Use logical “units” for scalability and isolation

To build a system that is highly available and horizontally scalable, you need an architecture built using scalable and composable building blocks. Specifically, the work done by a scalable system should increase linearly with the size of the system. The original Kafka architecture did not meet this criterion because much of the workload growth was non-linear with system size.

For example, as the cluster size increases, the number of connections grows quadratically because all clients typically need to communicate with all brokers. Similarly, since each agent typically has followers on all other agents, the replication overhead also grows quadratically. The end result is that adding proxies results in a disproportionate increase in overhead relative to the additional compute/storage power they bring.

The second challenge is ensuring isolation between tenants. In particular, a poorly performing tenant can negatively impact the performance and availability of every other tenant in the cluster. Even with effective throttling and throttling, there may always be some load patterns that are problematic. Even if the client is performing well, the node's storage may be degraded. Randomly distributed across the cluster, this will affect all tenants and potentially all applications.

We solve these challenges using logical building blocks called units. We divide the cluster into a set of cells that span availability zones. Tenants are isolated into a single cell, meaning a copy of every partition owned by that tenant is assigned to a broker in that cell. This also means that replication is restricted between agents within the cell. Adding agents to a cell brings up the same issues at the cell level as before, but now we have an option to create new cells in the cluster without adding overhead. Additionally, this gives us a way to deal with noisy tenants. We can move the tenant's partition into an isolation unit.

To measure the effectiveness of this solution, we set up an experimental cluster of 24 agents and six agent units (see our white paper for full configuration details). When we ran the benchmark, cluster load—a custom metric we designed to measure Kafka cluster load—was 53% when using cells and 73% when not using cells.

3. Optimize the use of different storage types through layered architecture

By layering the architecture to optimize the use of different storage types, we improve performance and reliability while reducing costs. This stems from the way we separate compute and storage, primarily in two ways: using object storage to hold cold data, and using block storage instead of instance storage to hold more frequently accessed data.

This layered architecture allows us to increase resiliency - redistribution of partitions becomes much easier when only hot data needs to be reallocated. Using EBS volumes instead of instance storage also improves durability because the lifecycle of the storage volume is decoupled from the lifecycle of the associated virtual machine.

Most importantly, layering allows us to significantly improve cost and performance. Costs are reduced because object storage is a more affordable and reliable option for storing cold data. And the performance improvement is because once the data is tiered, we can put the hot data in high-performance storage volumes, which would be costly without tiering.

4. Use abstractions to unify the multi-cloud experience

For any service planning to operate on multiple clouds, providing a unified and consistent customer experience across clouds is critical, but this is challenging on several fronts. Cloud services are complex, and even if they follow standards, there are still differences between different clouds and instances. Instance types, instance availability, and even billing models for similar cloud services can all differ in subtle but important ways. For example, Azure Block Storage does not allow independent configuration of disk throughput/IOPS, so large-capacity disks need to be configured to expand IOPS. In contrast, AWS and GCP allow you to adjust these variables independently.

Many SaaS providers shy away from this complexity, leaving customers to worry about the configuration details needed to achieve consistent performance. Obviously, this is not ideal, so in Kora we developed abstractions to smooth out these differences.

We introduce three abstractions that enable customers to stay away from implementation details and focus on higher-level application properties. These abstractions can help greatly simplify services and limit the questions customers need to answer on their own.

5. Automated mitigation loops to combat performance degradation

Troubleshooting is critical to reliability. Even in the cloud, failures are inevitable due to cloud provider outages, software bugs, disk corruption, misconfiguration, or other reasons. These may be complete or partial failures, but in either case they must be resolved quickly to avoid impacting performance or system access.

Unfortunately, if you are operating a cloud platform at scale, manually detecting and handling these failures is impossible. It takes up too much operator time and means issues may not be resolved quickly enough to maintain service level agreements.

To solve this problem, we built a solution to handle all such cases of infrastructure degradation. Specifically, we built a feedback loop that includes a degradation detection component that collects metrics from the cluster and uses them to decide if any component has failed and whether action needs to be taken. This allows us to automatically handle hundreds of degradations per week without any manual operator intervention.

Implement multiple feedback loops to track agent performance and take action when necessary. When an issue is discovered, it is marked to a specific agent health state and appropriate mitigation strategies are employed for each state. These three feedback loops address local disk issues, external connection issues, and agent degradation issues respectively.

  • Monitoring: A method of tracking each agent's performance from an external perspective. We do frequent probing to keep track.
  • Aggregation: In some cases we aggregate metrics to ensure degradation is noticeable relative to other agents
  • React: Kafka-specific mechanism for excluding brokers from or removing leadership from a replication protocol.

Indeed, our automated mitigation detects and automatically mitigates thousands of partial degradations each month across all three major cloud providers, saving valuable operator time while ensuring minimal impact to customers.

6. Balance stateful services for performance and efficiency

Balancing load across servers in any stateful service is a difficult problem that directly affects the quality of service experienced by customers. Uneven load distribution causes customers to be limited by the latency and throughput provided by the busiest servers. Stateful services typically have a set of keys, and you need to balance the distribution of those keys so that the overall load is evenly distributed across the servers, so customers get maximum performance and minimum cost from the system.

For example, Kafka runs a stateful broker that is responsible for balancing the distribution of partitions and their replicas to various brokers. Depending on customer activity, the load on these partitions can spike or drop in unpredictable ways. This requires a set of metrics and heuristics to determine how to place partitions to maximize efficiency and utilization. We achieve this through a balancing service that tracks a set of metrics from multiple brokers and continuously works in the background to redistribute partitions.

Rebalancing allocations needs to be done carefully. Overly aggressive rebalancing can destroy performance and increase costs due to the extra work these reallocations create. Rebalancing too slowly will cause the system to degrade significantly before the imbalance is repaired. We had to experiment with many heuristics to converge to an appropriate level of reactivity for various workloads.

The impact of effective balancing can be huge. One of our customers saw a load reduction of approximately 25% after enabling rebalancing for them. Similarly, another customer saw a significant reduction in latency due to rebalancing.

7. Advantages of Well-Designed Cloud-Native Services

If you are building cloud-native infrastructure for your organization, whether using new code or using existing open source software like Kafka, we hope the techniques described in this article will help you achieve your performance, availability, and cost-efficiency goals.

To test Kora's performance, we conducted small-scale experiments on the same hardware, comparing Kora and our all-cloud platform with open source Kafka. We found that Kora provides greater resiliency, scaling up to 30 times faster; more than 10 times better availability than the failure rates of our self-managed customers or other cloud services; and significantly lower latency than self-managed Kafka. While Kafka remains the best choice for running open source data streaming systems, Kora is a great choice for those looking for a cloud-native experience.