A sensible network engineer should have learned to deal with this kind of failure long ago
Hello everyone, I am Lao Yang.
I have said many times before that troubleshooting is a required course for every netizen. Many people are disabled at first, but it is relatively rare to have a disabled hand that causes serious conditions.
Today, a fan of Lao Yang shared with me about the "tragic situation" that happened when he made mistakes in his work.
After reading the whole article, Lao Yang thinks it is very meaningful and can be used as a reference for novice netizens.
He has a sentence that I agree with: "Failures caused by problems with the software or hardware of the device itself are actually relatively rare."
"Most of it is man-made."
Arranged like this, who can exaggerate more than me?
This is a network glitch I've encountered at work before.
I think the most important point in troubleshooting is that after you discover a network failure, you must have a basic judgment on the network environment of the existing network.
You have to judge what its networking is like and what its configuration is like. You have to collect these basic information before you can analyze the fault.
So I will first talk about the background of the failure, and give you a few clues to judge together.
The background of this failure is actually a very simple configuration change.
During the noon break, my colleague made configuration changes to the aggregation switches in a certain building in the factory.
In fact, he wanted to add an access switch. Because the number of plant personnel increased, the number of terminals also increased, so the existing network and access equipment could not meet the demand, and a new access layer switch was needed.
As we all know, adding an access switch to the existing network is a very simple configuration.
Generally, it is only necessary to install the access switch to the rack and connect two fibers to the aggregation switch.
And usually, what configuration do you need to do to connect the access switch to the aggregation switch?
Generally speaking, the VLAN required by the access switch is configured, and then the access switch is connected to the two interfaces of the aggregation switch, or one interface is configured as a trunk, and it is done.
The interconnected links between switches need to be configured as trunks. This is a very basic content in CCNA learning, and I will not repeat it here.
Then, in order to ensure the reliability of this link, it is necessary to configure a link aggregation from the access switch to the aggregation switch. This is also a relatively basic network learning content, and the configuration is also very simple.
Originally, according to the progress of the work, my colleague returned to the workstation to rest after brushing the configuration, and the fault did not appear immediately.
Because no one was working during the lunch break at that time, no one would find such a problem.
And it was only at 2 o'clock in the afternoon that the people who went to work realized when they were using the Internet, "Ah? The switch on the floor is faulty!"
Because for end users, the most intuitive experience is that the computer cannot access the Internet.
What if I can't get online? Complain.
So call tech support.
In fact, his configuration change is very simple. He just added a new device and added some configurations, but at this time, the network failed.
If this happened to you, how would you analyze the failure?
Let's take a look at its network architecture. The network architecture is actually a traditional three-tier architecture, with core connection aggregation, aggregation connection floor access switches, simple, right?
Just saying that its network architecture will use the stacking technology.
When its core switches are connected and converged, the two converged switches are stacked. The core switches are ignored here. The two aggregation switches are stacked. aggregation switch.
Although there are two aggregation switches here, the aggregation switch is logically a device because the two links are bundled.
Link bonding is also the basic technology of stacking.
The logical topology is shown in the figure. Is this topology an acyclic network? Because of the stacking technology, such a topology is actually an acyclic network.
The aggregation is two devices, then the access switch only needs to be equipped with a VLAN on the interface connected to the downlink and connected to the downlink terminal - PC, as long as a VLAN is divided, and then our uplink interface, configure link aggregation, and at the same time this uplink The link aggregation port is configured as a trunk port, which is three simple configurations.
Why do these three simple configurations cause loops in this network?
Of course, it wasn't known at the time that it was a loop.
Because my colleague thought at the time that this network had no loops, and he used stacking and configured link aggregation. Such a network has no loops...
So his first reaction at the time was not to suspect that the loop was a problem.
Now look at our first clue: the acyclic network.
And the second clue: the configuration is very simple, only three are equipped with vlan, trunk, and link aggregation.
Did you analyze anything?
Scroll down to the third clue: the device cannot log in remotely. How should this be understood?
In fact, what is the first action that ordinary people will take to deal with a network failure?
You must log in to the network device remotely and check the configuration to see if the configuration is correct.
Because they are all brushed-up configurations, the first step is to log in to the device remotely, either on the aggregation switch or on the access switch.
But when I tried to log in, I found that I couldn't get on. There is no way to log into both devices...
What if I can't log in?
The device cannot log in remotely, what usually happens?
It can't be pinged, and it can't be Talented. Some classmates have encountered this at work, right?
This usually happens. If the device cannot be pinged, you just need to make sure that there is no problem with being routed, but the device cannot be pinged and cannot be managed remotely, which means that its CPU is full.
Because all the packets that need to ping this device and need to directly access the IP address of this device need to be processed by the CPU.
For ping or talent, all remote packets need to be processed by the CPU. At this time, if the CPU soars to 100%, it will make it impossible to log in and manage remotely.
At this time, my colleague finally began to wonder if it was a network loop...
Because in the Layer 2 network, it can lead to 100% of the CPU, and the high probability is caused by the loop, so basically it can be judged that it is a loop.
but why? Why do loops occur? I named the configuration acyclic! Where's the loop from?
What should we do at this time? Just run to the spot.
I happened to be at the scene at the time. Since this problem suddenly occurred at noon that day, my colleague also asked me to help.
So I went to the scene with them to see what the problem was.
Generally speaking, you need to bring a console and a laptop when you go to the site.
Go to the site of the device, usually connect to the device remotely or directly through the console cable, log in to the device to view the configuration.
But everyone also knows that because the CPU of the device is 100% exploded, if you log in to the device through the console line, it will be very stuck. Basically, you will type a command and it will take a few seconds to respond.
Therefore, the problem of loop is generally encountered, and the CPU of the device itself is full, because it is difficult for you to check the configuration of this device, and it is difficult to troubleshoot.
So I used the most stupid method at that time - unplugging.
On the aggregation switch, all the ports of the aggregation switch connected to the access switch are unplugged one by one.
Finally, when the newly added device was pulled out, it was found that the network was restored. At this time, the newly added switch device could basically be located, and there was a problem.
The configuration is so simple, but there is a loop, which is outrageous.
What is the only scenario in this place where a loop might occur?
Here I tell you directly, this is because the link aggregation configuration here is wrong. When they configured link aggregation, they only configured one end, and only brushed the link aggregation configuration on the aggregation switch.
It can be seen here that the two ports of the aggregation switch are bundled together, but the downlink access switch has no link aggregation configured, that is, it has two independent ports.
So at that time, I pulled out the cable to restore the network, and I logged into the device to check the configuration.
This inspection made me laugh and laugh. I found that it was very simple. I just missed two configurations. I missed two link aggregation commands under the physical interface. It's that simple.
It is such a simple problem that causes the network loop this time, and the spanning tree does not take effect at this time.
In fact, it also enabled spanning tree, so he was very puzzled, "Why is there still a loop when I enable spanning tree? Is there a bug in your device, or is there something wrong with your device?"
So why does spanning tree not work?
Because for the aggregation switch, there is only one port;
But for the access switch, there are two independent ports.
After the spanning tree BPDUs sent from the access switch are sent to the upstream interface, the aggregation switch will not send them from the interface.
Because there is only one port, it will not be sent from this port, so the device itself will not think that there is a loop in the network at this time.
As a result, the spanning tree is useless at this time, and even if you open the spanning tree, the loop will not be detected.
Later, I reconfigured the link aggregation configuration and the network was restored.
In fact, it is a very simple negligence. When brushing the configuration, I missed two commands, which caused network failure.
In fact, more than 50% of the network failures in the existing network are caused by human beings, and they are basically caused by human negligence and configuration changes.
Failures caused by problems with the software or hardware of the device itself are relatively rare.
Most of them are man-made. Either the configuration is unreasonable, or the planning is unreasonable, which will lead to this or that problem.
So: serious + solid basic skills = good net worker.
In the early days of work, I think a lot of network engineers should have a similar situation, there will be one kind or another that should not appear.
In fact, it doesn't matter, as long as you master the theoretical knowledge of the network and implement every command carefully, many problems can be solved easily.
Like my network loop, the troubleshooting steps are actually very simple.
For the loop, if your network device cannot log in, then you can only use the most stupid way to unplug the cables one by one, because the device cannot be viewed, which is very stuck.
The above is the troubleshooting experience I shared today, I hope it can give you some inspiration.