The whole process of network troubleshooting of a senior old net worker is worth collecting!

2022.07.27
The whole process of network troubleshooting of a senior old net worker is worth collecting! ​

In the actual working process, the state of the switch is easily disturbed by the outside world, so that various network failures will occur in the local area network. In order to ensure the stable operation of the network, we must properly manage and maintain the switches in peacetime to avoid switch failures.

​We know that a switch is a very important network device in a local area network, and its working status is closely related to the online status of the client system.

However, in the actual working process, the state of the switch is easily disturbed by the outside world, so that various network failures will occur in the local area network.

In order to ensure the stable operation of the network, we must properly manage and maintain the switches in peacetime to avoid switch failures.

This article is reproduced from the troubleshooting experience of a senior old net worker. When he was maintaining the local area network of a building, he had encountered a fault phenomenon that the switch on the floor could not be pinged due to improper physical connection. Troubleshooting this kind of network failure has caused him a lot of trouble.

Since this fault is relatively typical, and its troubleshooting ideas can be used for reference, it is hereby shared with you.

As one of the necessary skills for network engineers, network troubleshooting is also a headache for many newbies.

1. Fault scene

The office building I was in charge of at that time included several companies. In order to ensure that each company could access the Internet independently, and to require their Internet access status not to be affected by other companies, I chose a routing switch as the core switch of the building network.

At the same time, a different virtual working subnet is set for each unit on the switch.

Since each unit is located on a different floor, the number of companies distributed on each floor is not exactly the same. Some floors have two or three units, and some floors have as many as five or six units.

The unit working subnets on different floors are all connected to the building LAN through the switches on the corresponding floors, and access the Internet through the hardware firewall in the building network.

In order to improve network management efficiency, network administrators usually manage and maintain switches through remote connection.

However, when I went to work that morning, when I was scanning and diagnosing the working status of each switching port of the core switch of the LAN, I found that one of the switching ports was in the down state.

So I checked the network management file and found that the port connected to this port was a layer 2 switch on the fifth floor.

When logging in to the switch on the floor remotely, it is found that the login cannot be successful. When using the ping command to test the IP address of the switch, the returned result is "Request time out";

Just when I wondered why no one reported the fault, the phone rang as expected, and sure enough, users from the fifth floor began to report network faults one after another.

According to the above fault phenomenon, I estimate that there may be an accident in the working state of the floor switch.

So I went to the scene of the faulty switch, cut off the power supply of the device, and turned it on again after a period of time to restart.

After the boot operation was complete, I used the ping command to test the switch's IP address.

At this time, the returned result is normal, and the remote login operation can be carried out smoothly.

However, half an hour later, the faulty switch has the same fault phenomenon again, and when the ping command test is performed, an abnormal test result is returned.

Later, I was worried, and after repeated startup tests, I found that the faulty switch could not be pinged normally.

2. In-depth investigation

Since the problem cannot be solved after repeated restarts, I estimate that the cause of the failure is more complicated, considering that this kind of failure phenomenon is often encountered in the process of network management.

So I did an in-depth investigation according to the following ideas:

Considering that in the entire building network, only a certain floor switch on the fifth floor has this phenomenon, so my preliminary judgment may be caused by the problem of the floor switch itself. 

In order to ensure that the cause of the fault can be accurately located, I plan to replace the faulty switch with a working switch to see if the fault still exists.

At the same time, connect the suspected switch to a separate network work environment.

After half an hour of testing and observation, I found that the faulty switch connected to the independent network environment is in normal working state, and its IP address can be pinged in this network environment.

After the newly replaced switch was connected to the building network, it could not be pinged normally.

According to these phenomena, I think there is almost no possibility of the switch itself on the fifth floor. After eliminating the faulty switch itself, I reviewed the networking structure and network status of the entire building network.

Since users on other floors in the building can access the Internet normally, only some users on the fifth floor cannot access the Internet.

Looking at the networking information on the fifth floor, I saw that there were five units distributed on the fifth floor. At that time, the network administrator arranged two floor switches on the fifth floor and connected them together by cascading;

At the same time, five virtual working subnets are divided into these two switches to ensure that each unit can work independently in its own virtual working subnet.

Since the corresponding port on the core switch has been shut down, all the units on the fifth floor cannot access the Internet. Why are only some users reporting faults?

As soon as work time came, I immediately called several other companies that did not report network failures. The answer was that they had just discovered that the network access was abnormal and were about to ask the building network administrator for help.

In this way, all the units on the fifth floor cannot access the Internet normally, so the cause of the failure should be in the virtual work subnets of these units.

After confining the troubleshooting to the five units on the fifth floor, I thought that by restarting the equipment on a switch on the fifth floor, the network failure could be temporarily restored.

It is only after half an hour that the same network failure phenomenon occurs again.

Compared with this special phenomenon, I suspect that it may be a network broadcast storm, which caused the switch to be blocked within a certain period of time, and finally blocked the corresponding switching port of the core switch.

In order to analyze the fault, I used the network monitoring tool to analyze the network transmission data packets on the cascade port of the switch on the fifth floor.

It was found that both the incoming data packet flow and the outgoing data packet flow were very large, almost 100 times higher than the normal value, which indicated that there was network congestion in the network on the fourth floor.

So what is the network congestion caused by network viruses?

Or network congestion caused by network loops?

I plan to observe the changes of the state information of the cascaded ports of the faulty switch, especially the changes of the output broadcast packets. If the output broadcast packets keep increasing every second, then nine times out of ten, it can prove that there is a network loop in the network on the fifth floor.

Based on this analysis idea, I use the Console control line to directly connect to the faulty switch, and log in to the background of the system as a system administrator.

At the same time, use the display command to check the changes of the output broadcast packets of the cascade port of the switch, and check it every one second, and then compare the results of each check.

After repeated tests, I found that the size of the outgoing broadcast packets of the faulty switch was indeed increasing continuously.

This shows that there must be a network loop in the five units on the fifth floor.

Looking closely at the two switches on the fifth floor, I found that the physical connection between them is normal.

In addition, each switch port of these two switches is directly connected to the wall internet sockets in each room on the fifth floor.

It stands to reason that as long as each room does not use switches for cascading at will, there should be no network loops.

Now, since it has been proved that there is a network loop in the network on the fifth floor, it means that someone must be using the switch to expand the Internet at will. We only need to find the expansion switch and check its physical connection to quickly find the specific faulty node. .

So I called the network administrators of the units on the fifth floor and asked them to check each office room and report the room using the lower-level switch.

It didn't take long for the inspection results to be fed back to me, and there were actually about 10 rooms using lower-level switches for extended Internet access.

At this time, I know that the network connection of these 10 rooms is most likely to have a network loop. Which room is it?

Do I have to go to the site of each room in turn and check their internet connection?

After careful consideration, I found the networking information and found out the switch port numbers used in these 10 rooms one by one.

Then use the network cable to directly insert these switch ports, and in the view mode of these ports, ping the IP address of the faulty switch in turn.

As a result, when pinging to the sixth switch port, I found that the ping from that port was not working properly.

In order to judge whether there is a problem with the switch port, I used the display command to check the status information of the switch port in the switch port view mode.

After inspection and analysis, I found that the size of the input and output data packets of the switch port is obviously abnormal. Therefore, I estimate that the switch port must be the cause of the abnormal working state of the faulty switch.

After consulting the archives, I quickly found the corresponding Internet room based on the exchange port number.

When I got there, I found that the only two Internet ports in the room were connected to small hubs, and there were several computers connected to these two hubs.

What's worse is that there is also a network cable connecting them directly, so that a network loop is formed between the two hubs. 

The broadcast storm caused by the loop eventually blocked the cascade port of the faulty switch, thus causing the entire building network to be unable to access the Internet normally.

3. Troubleshooting

After unplugging the extra network cable, I re-checked the status information of the switch port. It was found that the size of the input and output packets returned to normal.

When I checked the status of the corresponding switch port on the core switch again, I found that the "down" status of the cause had changed to the "up" status, and at this time, I could ping the faulty switch on the fourth floor normally.

This shows that the problem is indeed caused by a user in a room on the fifth floor who illegally extended the use of switches or hubs. Later, after further questioning the Internet users, I learned that their rooms were cleaned the night before, when all the network cables were unplugged.

After the cleaning work is over, Internet users do not know much about connection, so they plug and connect at will, which eventually causes a network loop.