Fan feedback: How to diagnose a company's overall Internet anomaly? Keep up with the pace, classic TCP flow analysis case!
The cases shared in this issue are related to wired network issues.
1. Background Introduction
After a company moved recently, the network equipment remained unchanged (a certain G router + a certain W switch network). Employees currently generally report that the network is slow, which is manifested as incomplete web page loading, long loading time, web page spinning, some web pages cannot be opened and need to be refreshed multiple times...
2. Network Topology
There are about 200 people using the Internet, and the network topology is as follows:
Three-layer network architecture, divided into multiple VLANs for use by business departments.
Four home broadbands are dynamic public network IPs, three telecoms + one mobile
3. Basic Analysis
For this kind of network-wide lag problem, the possible reasons may include: DNS server abnormality or slow response, uneven bandwidth load and bandwidth session limit, etc., so IT has made relevant optimization strategies on the egress router as follows:
Try 1: "Multi-line strategy" uses load balancing, no obvious improvement
Try 2: "Multi-line strategy" uses smart line selection, no obvious improvement
Try 3: LAN port DHCP DNS uses this DNS (telecom DNS) and no obvious improvement, previously used public DNS (Baidu, Alibaba, Tencent)
Try 4: Use policy routing to specify the egress for diversion, no obvious improvement
Confirm the number of sessions: Since the broadband uses the enterprise bandwidth of the public IP, it is confirmed by the operator that it is unlimited
4. In-depth analysis
(1) The remote end indeed discovered that the website was abnormally lagging, and the content was delayed or failed to load.
(2) Looking directly at the abnormal TCP stream captured by the PC to access the web (corresponding stream61 in the captured packet), it can be seen that the Seq of the server response is obviously problematic: the seq of the first packet from the server to the terminal after the handshake is 1, and the following message should have seq=1+length=1+0=1, but it became 3264449421:
Let's look at another abnormal TCP stream for accessing the web:
The Seq error of the TCP packet returned by the server causes the terminal to be unable to receive it. There may be two reasons:
One possibility is that the server really returns an error, which is very unlikely. We can further capture the WAN port to prove this.
The other possibility is that the TCP packet was modified by the router, resulting in a return error.
(3) Capture the abnormal TCP flow of the WAN port accessing the Web, as follows:
It can be clearly seen that during the TCP interaction between the router WAN and the website server, the ACK value responded by the router WAN is obviously wrong, giving a very large and incorrect value, instead of ack=seq+len.
(4) Compare it with the normal TCP flow interaction and it will be clear:
In this normal TCP interaction flow, the server and the router WAN port both follow the standard calculation of seq and ack in the TCP handshake, so both sides agree that there is no problem.
5. Cause location and solution
(1) Root cause:
Obviously, there is an abnormal TCP flow interaction in the egress router. The Seq value of the website TCP packet sent back internally is abnormal, and the PC cannot receive it; the ack value of the TCP packet requesting the website to respond externally is abnormal, and the website server cannot receive it and interact normally.
(2) Solution:
This is probably a product bug, but TCP streaming is often related to hardware acceleration. You can try turning off the router's hardware acceleration function to observe the use, but please note that turning off this function will generally lead to a decrease in throughput and forwarding performance.
(3) Follow-up:
It seems to have been resolved and the problem has been closed.