Actual combat case: The switches are aggregated into 2Gbps ports
2025.04.28
The switch aggregation port does not have a bandwidth balancing strategy, so if a PC reaches port 9 with probability, packet loss will occur.The cases shared in this issue are related to wired/wireless network issues.
1. Background introduction
The client is a studio whose main business is self-media. It has very high requirements for the network and will load a lot of upstream and downstream traffic every day. The home access is 10Gbps downstream 10G broadband, and the routing and core switches are also 10G devices. The aggregation and access switches are interconnected through aggregation. The equipment brand is a certain J.
The simplified on-site topology is as follows:
The planning configuration is
Foolproof network, network segment: 10.0.0.0/16
All routers and switches are managed devices
II. Problem phenomenon
IT customers mainly use terminals (PC1...PC6, etc.) to run large-volume download tasks. There are problems in the use process: the 2Gbps aggregation port LAG between the aggregation and access switches cannot be fully utilized, and the PC will experience lag when accessing the Internet when running large volumes of traffic.
III. Troubleshooting and analysis
Step 1: Confirm that the total throughput cannot reach 2Gbps
First, check the total throughput of the two links of the switch aggregation port. You can see that the two physical links connected to the switch are 996Mbps+250Mbps, which is about 1.3Gbps in total.
From the current phenomenon, it is because the flow of a single link has been full, but the other link is far from the bottleneck, so the total throughput cannot be full. From this phenomenon, if the traffic continues to run on port 9 (the link that has full traffic), then packet loss may occur.
Step 2: Confirm the packet loss situation under high traffic
When running high throughput, the results of the PC ping test of the external network and internal network gateway are as follows:
From the current phenomenon, it is indeed packet loss, and the high probability is that the traffic of the PC is hitting the physical link that is about to run full of 1Gbps.
In summary, based on this situation, we preliminarily judge that it is caused by uneven aggregation and diversion. Since the switch has limited information, we need to analyze and solve it from the principle side.
Step 3: Deeply think about the aggregation algorithm
Because the aggregation algorithm is not based on IP or MAC, a fixed four-tuple session flow will only go through one fixed physical link. Therefore, the current total traffic cannot be connected, based on the topology:
The analysis is as follows:
Session 1 comes from PC5 - the peak value may be 700Mbps, which is exactly hashed to the aggregation physical port 9
Session 2 comes from PC6 - the peak value may be 300Mbps, which is exactly hashed to the aggregation physical port 10
Session 3 comes from PC4 - the peak value may be 600Mbps, which is exactly hashed to the aggregation physical port 9
Session 4 comes from PC5 - the peak value may be 300Mbps, which is exactly hashed to the aggregation physical port 10
So the total traffic is:
9 physical port can actually only reach 1Gbps at most and cannot reach 1.3Gps (600+700)
10 physical port can actually reach the expected 600Mbps, because the port is 1bps and is not yet the bottleneck.
Basic conclusion:
The switch aggregation port does not have a bandwidth balancing strategy, so if a PC has a probability of going to port 9, it will cause packet loss. Based on this, it is consistent with the current situation that "the total throughput cannot reach 2Gbps and the PC communication packet loss occurs under high throughput"
IV. Solution
Based on the above analysis, this is a mathematical probability problem, so you can try to add aggregation ports, as follows:
Session 1—The peak value may be 700Mbps, which is exactly hashed to the aggregation physical port 9
Session 2—The peak value may be 300Mbps, which is exactly hashed to the aggregation physical port 10
Session 3—The peak value may be 600Mbps, which is exactly hashed to the aggregation physical port 11
That is, the hash of each port can be balanced with a high probability, and each physical port is maintained below 1Gbps
It can be clearly seen that the distribution is more even, and the traffic of each physical port has not reached the 1Gbps bottleneck. Then test the situation of the PC pinging the gateway and the external network:
Basically, there is no packet loss problem anymore, the problem is solved!