Beyond Sidecar: In-depth analysis of Istio Ambient Mode's traffic mechanism

The following is a speech I gave at KCD Beijing. It mainly discusses Istio's new data plane mode - Ambient Mode. Its core concept is to eliminate Sidecar, reduce resource overhead and maintenance complexity. This article will take you to understand the emergence of Ambient Mode, background core components, traffic path mechanism, and comparison with the existing Sidecar mode, so as to help you quickly evaluate and get started with these new features.

 

Why should you pay attention to the environment mode?

 

First, let's think about a question: Why should you pay attention to or even try this new mode? Sidecar has always been used well in the service mesh, so why go to "Sidecar"?

 

Let's take a look at some of the problems and challenges facing current services.

 

Challenges of service grids

Resource overhead and transportation complexity brought by sidecar agents

 

When upgrading or restarting Envoy, all Pods usually need to be restarted

 

Increasing demand for performance and loss

 

Thinking: Is there a way to reduce the intrusion and additional resource consumption of each Pod while retaining the core capabilities of the service mesh (security, portability, and traffic control)?

 

Several deployment modes of the service mesh

 

The service mesh architecture has been exploring various possibilities for proxy deployment locations. For example:

 

Sidecar: Run an Envoy in each Pod.

 

Ambient: Split the proxy from the Pod to the node level (the mode to be discussed in this article).

 

Cilium Mesh: Use eBPF to do L4 in the kernel space, and then combine with Envoy to provide L7 functions.

 

gRPC: Directly integrate mesh capabilities into the SDK.

 

These modes have different focuses on functionality, security, performance, and management complexity. The Istio Ambient mode proposes a new attempt to address the high resource consumption and maintenance costs brought by Sidecar.

 

The birth of the environment mode

Istio's new generation architecture removes Sidecar and achieves lightweight data plane through ztunnel + Waypoint Proxy.

 

Save resources and reduce transport complexity.

 

Still supports mTLS, policy control, and provides an optional Waypoint Proxy for traffic that requires L7 functions.

 

Deployment Mode Quadrant

 

The following table is a brief summary of some of the more common service mesh deployment modes:

 

Istio Ambient Mode Core Concepts

Next, we will officially enter the second part and take a closer look at the specific components of Ambient Mode, including ztunnel, Waypoint Proxy, and the role played by Istio CNI.

 

Core components of the environment mode

1. z tunnel (L4)

 

Runs as a node-level proxy

 

Responsible for transparent traffic interception and mTLS encryption

 

Applicable to most L4 forwarded traffic

 

2. Waypoint proxy (L7)

 

Optional configuration (flexible configuration based on namespace/Service/Pod granularity)

 

Handles high-level functions such as HTTP/gRPC (authentication, routing, accessibility, etc.)

 

3. Istio CNI

 

Replaces the istio-init container and is responsible for traffic hijacking

 

Compatible with Sidecar mode and Ambient mode

 

Allow traffic redirection for Pods in non-privileged mode

 

Overall architecture of Ambient mode

Istio Ambient mode architecture

Istio Ambient mode architecture

 

In Ambient mode, the Istio data plane can be divided into two layers:

 

1. Security layer (ztunnel): A lightweight L4 agent is deployed on each node.

 

2. Optional L7 layer (Waypoint Proxy): Deploy only when HTTP/gRPC proxy is needed.

 

The Control Plane is still provided by Istiod, which is mainly responsible for issuing configurations and certificates to ztunnel and Waypoint.

 

Waypoint Proxy deployment strategy

Namespace level (default): Applicable to all workloads under the namespace

 

Service level: Only certain key services require L7

 

Pod level: More granular control

 

Cross-namespace: Can use gateway resource sharing

 

Istio CNI

Traffic interception: Replaces the istio-init container to make installation clearer and simpler.

 

Supports two modes: Compatible with Sidecar mode and Ambient mode.

 

Non-privileged mode compatibility: Allows Pods to run in non-privileged mode to enhance security.

 

CNI Chaining: Expand the node's CNI configuration by adding Istio CNI.

 

Traffic redirection within Pod (Ambient mode):

 

Use iptables REDIRECT rules within the Pod's network namespace.

 

Establish socket interception and proxy traffic within the Pod.

 

This diagram simply illustrates how Istio CNI is combined with Kubernetes' own network plug-ins (such as Calico, Cilium, etc.). It modifies the local CNI configuration and adds CNI chaining. After Kubernetes allocates the Pod IP, it immediately executes the Istio CNI interception logic and injects the network traffic rules into the Pod netns. And set different iptables rules for Pods in different modes. This forms a chained process with other CNI configurations (including network policies) and does not conflict with each other.

 

How the Istio CNI plugin works

This diagram details what Istio CNI does when a Pod is started:

 

How the Istio CNI plugin works

 

1. It enters the Pod's network namespace and creates a set of iptables rules to hijack traffic to the socket that ztunnel listens to.

 

2. It is no longer necessary to inject init containers into each Pod, and no privileges are required, which makes the overall deployment cleaner and safer.

 

3. ztunnel creates a socket in the pod's network namespace and one for each pod on the node.

 

Traffic Path and Key Mechanisms

After introducing the components, let's take a look at the core "traffic path". How do zTunnel and Waypoint intercept and forward traffic? We will analyze it from the perspectives of transparent traffic interception and HBONE protocol.

 

Transparent Traffic Interception

In Ambient mode, Istio CNI will inject iptables rules into the Pod network abstraction space to transparently intercept outbound traffic to the ztunnel process of the node. After that, ztunnel decides whether to directly forward L4 or forward the traffic to Waypoint Proxy for further L7 processing.

 

As shown in the figure, Kubelet starts a Pod on the node. This event is monitored by Istio CNI Agent. Istio CNI Agent enters the Pod's network space and sets iptables rules to redirect traffic to the local socket, forward Pod's files to the local socket, and forward Pod's files to ztunnel. After ztunnel obtains the FD, it can create a socket in the Pod's network space.

 

When the Pod transmits traffic, it should directly connect to the target address, but the iptables rule intercepts it to the ztunnel process of this node, and then ztunnel decides that this traffic needs to be associated with Waypoint as an L7 proxy. If it is not necessary, it will be directly encrypted and forwarded to the target Pod at the L4 layer; if L7, such as authentication, is required, the traffic will be tunneled to Waypoint.

 

Transparent Traffic Interception

Transparent Traffic Interception

 

Overview of the Packet Lifecycle

1. Pod ztunnel: Pod traffic is first intercepted by CNI to the ztunnel of this node.

 

2. ztunnel: resolve the target address and perform mTLS encryption.

 

3. (If L7 policy is required) ztunnel Waypoint Proxy: HTTP authentication, routing and other operations.

 

4. Waypoint Proxy: After completing L7 processing, it sends it back to ztunnel.

 

5. ztunnel: Unpack or continue forwarding to the target node ztunnel.

 

6. Reach the target Pod: The target node ztunnel finally migrates the traffic to the target Pod.

 

HBONE protocol

In Ambient mode, the HBONE (HTTP/2 + CONNECT) protocol is used between zTunnel and Waypoint to establish a secure tunnel, implement mTLS encryption and multiplexing, reduce additional connection overhead, and simplify the proxy forwarding process.

 

HBONE protocol

HBONE protocol

 

The following is a simplified HBONE CONNECT request example, which uses header information such as x-envoy-original-dst-host and x-istio-auth-userinfo to pass the content required for routing and identity authentication.

 

In this example, assuming that ztunnel A needs to send traffic to the target node B, we can see that the outer TCP connection is actually from ztunnel_A_IP:52368 to Node_B_IP:15008. This is the tunnel port between ztunnel A and ztunnel B, and 15008 is the HBONE listening port.

 

After entering the HTTP/2 layer, there will be a CONNECT request, and the :authority in it shows Pod_B_IP:9080, indicating that the actual connection is to Pod B's port 9080. The same information can be seen in x-envoy-original-dst-host.

 

At the same time, we see some custom headers, such as x-forwarded-proto, x-istio-attributes, etc., which are used to bring more context and security verification information to the target ztunnel or subsequent proxy.

 

This can be understood as: on top of HTTP/2 CONNECT, the traffic is like adding an "inner" tunnel, encapsulating the application layer request (such as /api/v1/users?id=123) in it, and then unpacking it on the ztunnel B side and forwarding it to the real Pod B.

 

The whole process is transparent to the application, but for us, by looking at this CONNECT request header, we can understand how the Ambient mode does traffic routing and identity authentication at the HTTP/2 layer. This is why HBONE is more flexible than traditional Sidecar-to-Sidecar communication and is more convenient for mTLS and L7 expansion.

 

Encrypted traffic on the same node

 

If the source Pod and the destination Pod happen to be on the same node, the traffic will go through the L4 encryption process of ztunnel. Here it is shown that ztunnel is deployed on each node using DaemonSet and uses the host Network to share the host's network. Istio CNI intercepts the outbound traffic of the Pod to port 15001 of ztunnel. If the source and destination Pods are on the same node, ztunnel directly completes encryption and decryption internally and sends the traffic to the destination Pod.

 

If L7 traffic processing is required, such as authentication, ztunnel will establish an HBONE tunnel with the Waypoint agent and forward it to the destination Pod through the Waypoint agent.

 

Cross-node encrypted traffic (L4)

This is a cross-node situation, which is the most common scenario:

 

The ztunnel of the source node encrypts the traffic through the HBONE tunnel and sends it to the ztunnel of the target node; the ztunnel of the target node unpacks it and then passes the plaintext traffic to the target Pod. As long as it is pure L4 and does not require L7, there is no need to add a layer of Waypoint, which reduces the proxy connection.

 

Encrypted traffic across nodes (L4)

 

Encrypted traffic across nodes (L4)

 

Encrypted traffic across nodes (L7)

When we need L7 processing, the traffic will pass through Waypoint one more time. That is:

 

The source ztunnel first tunnels the traffic to Waypoint;

 

Waypoint performs authentication, routing, etc. at the HTTP layer;

 

Waypoint then uses HBONE to send the traffic to the target ztunnel;

 

The target ztunnel is unpacked and sent to the target Pod.

 

This process has one more proxy than L4, but the advantage is that only specific traffic will be parsed by the L7 proxy, reducing unnecessary overhead.

 

Backstop traffic (prevent traffic from escaping)

For traffic outside the Istio mesh, when directly accessing the Pod through the Pod IP and port, in order to prevent this traffic from escaping the control of ztunnel, it is also necessary to intercept this traffic. If the traffic is to access the application port, it is determined whether the packet carries the 0x539 mark. If not, it is forwarded to the plaintext port 15006 monitored by ztunnel. After being processed by ztunnel, it will carry the 0x539 mark, and then the target port of the application can be accessed; if the destination of the traffic is 15008, then the target port of the application can be determined; if the destination of the traffic is 150087

 

For most traffic that only needs TCP-level encryption and forwarding, Ambient Mode only uses ztunnel; it only passes through Waypoint when HTTP-level policies are required.

 

Ambient Mode vs. Sidecar Mode

After understanding Ambient, we still have to compare it with the original Sidecar mode to see which functions are not yet perfect and which scenarios are more suitable for Ambient.

 

Limitations of Ambient Mode

Compared with the traditional Sidecar mode, Ambient still has some imperfections:

 

When using Sidecar and Ambient together, it is difficult to make precise proxy customization for a single Pod (such as EnvoyFilter).

 

Support for multi-cluster, multi-network, and virtual machine workloads is not yet complete, so use with caution in production environments.

 

Some deep customizations (such as WASM plugins) cannot be implemented directly one-to-one in Ambient.

 

Selection recommendations

1. If you already have a Sidecar architecture and rely on a large number of mature features: you can continue to use Sidecar first.

 

2. If you pursue resource saving, simplified maintenance, and most traffic only requires L4: you can try Ambient Mode.

 

3. If some applications still need to retain Sidecar, you can consider hybrid deployment, but you need to plan the boundaries and strategies of Sidecar/Ambient.

 

Summary

Okay, finally let's summarize the advantages and disadvantages of Ambient Mode and what scenarios it is currently suitable for.

 

Key points review

1. Ambient Mode: Remove Sidecar, reduce the proxy burden of each Pod, and significantly reduce resource and maintenance costs.

 

2. ztunnel + Waypoint architecture: Waypoint is enabled only when L7 functions are required, and other traffic is quickly forwarded in L4 mode.

 

3. Although the official has announced Ambient Mode GA, further observation and testing are still needed for multiple clusters/VMs/multiple networks.

 

4. Applicable scenarios: Large-scale clusters + mainly L4 traffic, teams with high resource and management requirements can focus on it.