Spiderpool: How to solve the problem of zombie IP recycling
Spiderpool: How to solve the problem of zombie IP recycling
How do I reclaim zombie IPs in an Underlay network? Spiderpool, a cloud-native network open source project, provides a solution, let's find out.
01Underlay Network Solutions
Why do you need an Underlay network solution? In a data center private cloud, there are many use cases that require an Underlay network:
- Low latency and high throughput: In some scenarios that require low latency and high throughput, the Underlay network offering is often superior to the Overlay network solution. Because the Underlay network is built on a physical network, it can provide faster and more stable network transfer services.
- Migrating traditional host applications to the cloud: In data centers, many traditional host applications still use traditional network interconnection methods, such as service exposure and discovery, and multi-subnet interconnection. In this case, the needs of these applications can be better met with Underlay networking solutions.
- Data center network management: Data center managers often need to implement security controls for applications, such as firewalls, VLAN isolation, and so on. In addition, they need to use traditional network observation methods to implement cluster network monitoring. These needs can be more easily implemented with Underlay networking solutions.
- Independent host NIC planning: In some special application scenarios, such as Kubevirt, storage projects, and log projects, you need to implement independent host NIC planning to ensure bandwidth isolation of the underlying subnet. The needs of these applications can be better supported by using Underlay networking solutions, improving application performance and reliability.
With the continuous popularity of data center private cloud, Underlay network, as an important part of data center network architecture, has been widely used in data center network architecture to provide more efficient network transmission and better network topology management capabilities.
02Zombie IP issues in the Underlay network
What is a zombie IP? The IP addresses assigned to pods are recorded in IPAM, but these pods no longer exist in the Kubernetes cluster, and these IPs can be called zombie IPs.
In actual production, it is inevitable to encounter zombie IPs in the cluster, such as:
- When deleting pods in the cluster, due to network exceptions or cni binary crash and other issues, the call to cni delete fails, resulting in the IP address cannot be recycled by cni.
- After a node goes down unexpectedly, the pods in the cluster are permanently in the 'deleting' state, and the IP addresses occupied by the pods cannot be released.
In a Kubernetes cluster using the Underlay network, when a zombie IP appears, it may cause the following problems:
- Limited IP resources under the Underlay network: In large-scale clusters, the number of pods may be very large, IPAM will assign a designated Underlay subnet IP to each pod instance for network communication, if there is a zombie IP problem, it may cause a lot of IP resources to be wasted, or there will be no Underlay IP resources available.
- Fixed IP requirements, resulting in failure to start new pods: An IP pool with 10 IP addresses is fixed to 10 replicas for application, if the above zombie IP problem occurs, the old Pod IP cannot be reclaimed, and the new pod will fail to start because of lack of IP resources and inability to obtain available IPs. This threatens the stability and reliability of the application, and may even cause the entire application to not function properly.
Solution 03: Spiderpool
Spiderpool (https://github.com/spidernet-io/spiderpool) is a Kubernetes Underlay network solution, by providing lightweight meta plugins and IPAM plugins, Spiderpool flexibly integrates and enhances existing CNI projects in the open source community, and simplifies the operation and maintenance of IPAM under the Underlay network to the greatest extent. It makes multi-CNI collaborative work truly landable, supporting running in bare metal, virtual machines, public clouds and other environments.
Spiderpool addresses failed IPs in the Underlay network through the following IP reclamation mechanisms:
- For pods in the Terminating state, Spiderpool will automatically release the IP address after the pod's spec.terminationGracePeriodSecond. This functionality can be controlled through the environment variable SPIDERPOOL_GC_TERMINATING_POD_IP_ENABLED. This capability can be used to solve the failure scenario of unplanned node downtime.
- In failure scenarios such as cni delete failure, if a pod that once had an assigned IP is destroyed, but the assigned IP address is also recorded in IPAM, a zombie IP is formed. In response to this problem, Spiderpool automatically reclaims these zombie IP addresses based on the periodic and event scanning mechanism.
04 Equal scale IP allocation test
IPAM requires accurate allocation of IP addresses, and Spiderpool also has robust fault IP recovery capabilities, and the author did the following proportional IP allocation test to verify. This equal-ratio IP allocation test is based on version 0.3.1 of CNI Specification, using Macvlan with Spiderpool (version v0.6.0) as the test scheme, and selecting Whereabouts (version v0.6.2) in the open source community with Macvlan, Kube-ovn (version v1.11.8) and Calico-ipam (version v3.26.1) Several network solutions are compared, and the test scenarios are as follows:
1. Create 1000 pods and limit the number of available IPv4/IPv6 addresses to 1000, ensuring that the number of available IP addresses and pods is 1:1.
2. Use the following command to rebuild the 1000 pods at once, and record the running time of all the 1000 rebuilt pods. Verify that when fixing IP addresses, whether each IPAM plug-in can quickly adjust the limited IP address resources of the concurrently reconstructed pods in scenarios involving IP address recycling, preemption, and conflicts to ensure the speed of application recovery.
~# kubectl get pod | grep "prefix" | awk '{print $1}' | xargs kubectl delete pod
- 1.
3. Power down all nodes and then power them on, simulate fault recovery, and record the time it takes for 1000 pods to reach running again.
4. Delete all deployments and record the time it takes for all pods to disappear completely.
The test data is as follows:
The IPAM allocation principle of Spiderpool and Kube-ovn is that all pods of the entire cluster node allocate IP from the same CIDR, so IP allocation and release need to face fierce competition, and the challenge of IP allocation performance will be greater; The IPAM allocation principle of Whereabouts and Calico is that each node has a small IP collection, so the competition for IP allocation is relatively small, and the challenge of IP allocation performance will be small. However, from the experimental data, although Spdierpool's IPAM principle is "lossy", the performance of allocated IP is very good.
During the test of the combination Macvlan + Whereabouts, 922 pods in the created scene reached the running state at a relatively uniform rate within 14m25s, and since then the pod growth rate has decreased significantly, and finally 1000 pods took 21m49s to reach the running state. As for the rebuilt scenario, after 55 pods reach the Running state, Whereabouts cannot continue to assign IPs to pods. Since the number of IP addresses and pods in the test scenario is 1:1, if the IPAM component fails to properly reclaim IPs, new pods will fail to start because they lack IP resources and cannot obtain available IPs.
05 Summary
From the above tests, it can be found that Spiderpool performs well in various test scenarios. Although Spiderpool is an IPAM solution for Underlay networks, its IP allocation and recycling capabilities are no less than mainstream overlay CNIs such as Calico, although Spiderpool faces more complex IP address preemption and conflict problems.