Beyond Sidecar: In-depth analysis of Istio Ambient Mode's traffic mechanism
The following is a speech I gave at KCD Beijing. It mainly
discusses Istio's new data plane mode - Ambient Mode. Its core concept is to
eliminate Sidecar, reduce resource overhead and maintenance complexity. This
article will take you to understand the emergence of Ambient Mode, background
core components, traffic path mechanism, and comparison with the existing
Sidecar mode, so as to help you quickly evaluate and get started with these new
features.
Why should you pay attention to the environment mode?
First, let's think about a question: Why should you pay
attention to or even try this new mode? Sidecar has always been used well in
the service mesh, so why go to "Sidecar"?
Let's take a look at some of the problems and challenges
facing current services.
Challenges of service grids
•
Resource overhead and transportation complexity brought by sidecar agents
• When
upgrading or restarting Envoy, all Pods usually need to be restarted
•
Increasing demand for performance and loss
Thinking: Is there a way to reduce the intrusion and
additional resource consumption of each Pod while retaining the core
capabilities of the service mesh (security, portability, and traffic control)?
Several deployment modes of the service mesh
The service mesh architecture has been exploring various
possibilities for proxy deployment locations. For example:
•
Sidecar: Run an Envoy in each Pod.
•
Ambient: Split the proxy from the Pod to the node level (the mode to be
discussed in this article).
• Cilium
Mesh: Use eBPF to do L4 in the kernel space, and then combine with Envoy to
provide L7 functions.
• gRPC:
Directly integrate mesh capabilities into the SDK.
These modes have different focuses on functionality,
security, performance, and management complexity. The Istio Ambient mode
proposes a new attempt to address the high resource consumption and maintenance
costs brought by Sidecar.
The birth of the environment mode
•
Istio's new generation architecture removes Sidecar and achieves lightweight
data plane through ztunnel + Waypoint Proxy.
• Save
resources and reduce transport complexity.
• Still
supports mTLS, policy control, and provides an optional Waypoint Proxy for
traffic that requires L7 functions.
Deployment Mode Quadrant
The following table is a brief summary of some of the more
common service mesh deployment modes:
Istio Ambient Mode Core Concepts
Next, we will officially enter the second part and take a
closer look at the specific components of Ambient Mode, including ztunnel,
Waypoint Proxy, and the role played by Istio CNI.
Core components of the environment mode
1. z tunnel (L4)
• Runs
as a node-level proxy
•
Responsible for transparent traffic interception and mTLS encryption
•
Applicable to most L4 forwarded traffic
2. Waypoint proxy (L7)
•
Optional configuration (flexible configuration based on namespace/Service/Pod
granularity)
•
Handles high-level functions such as HTTP/gRPC (authentication, routing,
accessibility, etc.)
3. Istio CNI
•
Replaces the istio-init container and is responsible for traffic hijacking
•
Compatible with Sidecar mode and Ambient mode
• Allow
traffic redirection for Pods in non-privileged mode
Overall architecture of Ambient mode
Istio Ambient mode architecture
Istio Ambient mode architecture
In Ambient mode, the Istio data plane can be divided into
two layers:
1. Security layer (ztunnel): A lightweight L4 agent is
deployed on each node.
2. Optional L7 layer (Waypoint Proxy): Deploy only when
HTTP/gRPC proxy is needed.
The Control Plane is still provided by Istiod, which is
mainly responsible for issuing configurations and certificates to ztunnel and
Waypoint.
Waypoint Proxy deployment strategy
•
Namespace level (default): Applicable to all workloads under the namespace
•
Service level: Only certain key services require L7
• Pod
level: More granular control
•
Cross-namespace: Can use gateway resource sharing
Istio CNI
•
Traffic interception: Replaces the istio-init container to make installation
clearer and simpler.
•
Supports two modes: Compatible with Sidecar mode and Ambient mode.
•
Non-privileged mode compatibility: Allows Pods to run in non-privileged mode to
enhance security.
• CNI
Chaining: Expand the node's CNI configuration by adding Istio CNI.
•
Traffic redirection within Pod (Ambient mode):
Use iptables REDIRECT rules within the Pod's network
namespace.
Establish socket interception and proxy traffic within the
Pod.
This diagram simply illustrates how Istio CNI is combined
with Kubernetes' own network plug-ins (such as Calico, Cilium, etc.). It
modifies the local CNI configuration and adds CNI chaining. After Kubernetes
allocates the Pod IP, it immediately executes the Istio CNI interception logic
and injects the network traffic rules into the Pod netns. And set different
iptables rules for Pods in different modes. This forms a chained process with
other CNI configurations (including network policies) and does not conflict
with each other.
How the Istio CNI plugin works
This diagram details what Istio CNI does when a Pod is
started:
How the Istio CNI plugin works
1. It enters the Pod's network namespace and creates a set
of iptables rules to hijack traffic to the socket that ztunnel listens to.
2. It is no longer necessary to inject init containers into
each Pod, and no privileges are required, which makes the overall deployment
cleaner and safer.
3. ztunnel creates a socket in the pod's network namespace
and one for each pod on the node.
Traffic Path and Key Mechanisms
After introducing the components, let's take a look at the
core "traffic path". How do zTunnel and Waypoint intercept and
forward traffic? We will analyze it from the perspectives of transparent
traffic interception and HBONE protocol.
Transparent Traffic Interception
In Ambient mode, Istio CNI will inject iptables rules into
the Pod network abstraction space to transparently intercept outbound traffic
to the ztunnel process of the node. After that, ztunnel decides whether to
directly forward L4 or forward the traffic to Waypoint Proxy for further L7
processing.
As shown in the figure, Kubelet starts a Pod on the node.
This event is monitored by Istio CNI Agent. Istio CNI Agent enters the Pod's
network space and sets iptables rules to redirect traffic to the local socket,
forward Pod's files to the local socket, and forward Pod's files to ztunnel.
After ztunnel obtains the FD, it can create a socket in the Pod's network
space.
When the Pod transmits traffic, it should directly connect
to the target address, but the iptables rule intercepts it to the ztunnel
process of this node, and then ztunnel decides that this traffic needs to be
associated with Waypoint as an L7 proxy. If it is not necessary, it will be
directly encrypted and forwarded to the target Pod at the L4 layer; if L7, such
as authentication, is required, the traffic will be tunneled to Waypoint.
Transparent Traffic Interception
Transparent Traffic Interception
Overview of the Packet Lifecycle
1. Pod →
ztunnel: Pod traffic is first intercepted by CNI to the ztunnel of this node.
2. ztunnel: resolve the target address and perform mTLS
encryption.
3. (If L7 policy is required) ztunnel → Waypoint Proxy: HTTP authentication, routing and other
operations.
4. Waypoint Proxy: After completing L7 processing, it sends
it back to ztunnel.
5. ztunnel: Unpack or continue forwarding to the target node
ztunnel.
6. Reach the target Pod: The target node ztunnel finally
migrates the traffic to the target Pod.
HBONE protocol
In Ambient mode, the HBONE (HTTP/2 + CONNECT) protocol is
used between zTunnel and Waypoint to establish a secure tunnel, implement mTLS
encryption and multiplexing, reduce additional connection overhead, and
simplify the proxy forwarding process.
HBONE protocol
HBONE protocol
The following is a simplified HBONE CONNECT request example,
which uses header information such as x-envoy-original-dst-host and
x-istio-auth-userinfo to pass the content required for routing and identity
authentication.
In this example, assuming that ztunnel A needs to send
traffic to the target node B, we can see that the outer TCP connection is
actually from ztunnel_A_IP:52368 to Node_B_IP:15008. This is the tunnel port
between ztunnel A and ztunnel B, and 15008 is the HBONE listening port.
After entering the HTTP/2 layer, there will be a CONNECT
request, and the :authority in it shows Pod_B_IP:9080, indicating that the
actual connection is to Pod B's port 9080. The same information can be seen in
x-envoy-original-dst-host.
At the same time, we see some custom headers, such as
x-forwarded-proto, x-istio-attributes, etc., which are used to bring more
context and security verification information to the target ztunnel or
subsequent proxy.
This can be understood as: on top of HTTP/2 CONNECT, the
traffic is like adding an "inner" tunnel, encapsulating the
application layer request (such as /api/v1/users?id=123) in it, and then
unpacking it on the ztunnel B side and forwarding it to the real Pod B.
The whole process is transparent to the application, but for
us, by looking at this CONNECT request header, we can understand how the
Ambient mode does traffic routing and identity authentication at the HTTP/2
layer. This is why HBONE is more flexible than traditional Sidecar-to-Sidecar
communication and is more convenient for mTLS and L7 expansion.
Encrypted traffic on the same node
If the source Pod and the destination Pod happen to be on
the same node, the traffic will go through the L4 encryption process of
ztunnel. Here it is shown that ztunnel is deployed on each node using DaemonSet
and uses the host Network to share the host's network. Istio CNI intercepts the
outbound traffic of the Pod to port 15001 of ztunnel. If the source and
destination Pods are on the same node, ztunnel directly completes encryption
and decryption internally and sends the traffic to the destination Pod.
If L7 traffic processing is required, such as
authentication, ztunnel will establish an HBONE tunnel with the Waypoint agent
and forward it to the destination Pod through the Waypoint agent.
Cross-node encrypted traffic (L4)
This is a cross-node situation, which is the most common
scenario:
The ztunnel of the source node encrypts the traffic through
the HBONE tunnel and sends it to the ztunnel of the target node; the ztunnel of
the target node unpacks it and then passes the plaintext traffic to the target
Pod. As long as it is pure L4 and does not require L7, there is no need to add
a layer of Waypoint, which reduces the proxy connection.
Encrypted traffic across nodes (L4)
Encrypted traffic across nodes (L4)
Encrypted traffic across nodes (L7)
When we need L7 processing, the traffic will pass through
Waypoint one more time. That is:
• The
source ztunnel first tunnels the traffic to Waypoint;
•
Waypoint performs authentication, routing, etc. at the HTTP layer;
•
Waypoint then uses HBONE to send the traffic to the target ztunnel;
• The
target ztunnel is unpacked and sent to the target Pod.
This process has one more proxy than L4, but the advantage
is that only specific traffic will be parsed by the L7 proxy, reducing
unnecessary overhead.
Backstop traffic (prevent traffic from escaping)
For traffic outside the Istio mesh, when directly accessing
the Pod through the Pod IP and port, in order to prevent this traffic from
escaping the control of ztunnel, it is also necessary to intercept this
traffic. If the traffic is to access the application port, it is determined
whether the packet carries the 0x539 mark. If not, it is forwarded to the
plaintext port 15006 monitored by ztunnel. After being processed by ztunnel, it
will carry the 0x539 mark, and then the target port of the application can be
accessed; if the destination of the traffic is 15008, then the target port of
the application can be determined; if the destination of the traffic is 150087
For most traffic that only needs TCP-level encryption and
forwarding, Ambient Mode only uses ztunnel; it only passes through Waypoint
when HTTP-level policies are required.
Ambient Mode vs. Sidecar Mode
After understanding Ambient, we still have to compare it
with the original Sidecar mode to see which functions are not yet perfect and
which scenarios are more suitable for Ambient.
Limitations of Ambient Mode
Compared with the traditional Sidecar mode, Ambient still
has some imperfections:
• When
using Sidecar and Ambient together, it is difficult to make precise proxy
customization for a single Pod (such as EnvoyFilter).
•
Support for multi-cluster, multi-network, and virtual machine workloads is not
yet complete, so use with caution in production environments.
• Some
deep customizations (such as WASM plugins) cannot be implemented directly
one-to-one in Ambient.
Selection recommendations
1. If you already have a Sidecar architecture and rely on a
large number of mature features: you can continue to use Sidecar first.
2. If you pursue resource saving, simplified maintenance,
and most traffic only requires L4: you can try Ambient Mode.
3. If some applications still need to retain Sidecar, you
can consider hybrid deployment, but you need to plan the boundaries and
strategies of Sidecar/Ambient.
Summary
Okay, finally let's summarize the advantages and
disadvantages of Ambient Mode and what scenarios it is currently suitable for.
Key points review
1. Ambient Mode: Remove Sidecar, reduce the proxy burden of
each Pod, and significantly reduce resource and maintenance costs.
2. ztunnel + Waypoint architecture: Waypoint is enabled only
when L7 functions are required, and other traffic is quickly forwarded in L4
mode.
3. Although the official has announced Ambient Mode GA,
further observation and testing are still needed for multiple
clusters/VMs/multiple networks.
4. Applicable scenarios: Large-scale clusters + mainly L4
traffic, teams with high resource and management requirements can focus on it.