Practical example: After every power outage, some APs are always unable to go online?

The case shared in this issue is related to wired network issues.

1. Background introduction
A shopping mall uses a certain P brand AC+AP, with a total of more than 50 points, to achieve wireless coverage networking. Recently, the operation and maintenance personnel found that every time the power is cut off in the computer room, there are always some APs that cannot go online. They must try to power off and restart (unplug and plug the PoE switch port) these APs multiple times before they can come up.

The topology is also very simple:

Network segment: 192.168.0.0/23, starting address: 192.168.0.1-192.168.1.254
2. Processing ideas
For the problem of AP offline, what is our standard troubleshooting idea? Mainly as follows:


Confirm the operation status and wiring of the AP;
Confirm whether the AP has obtained the IP address correctly;
Confirm whether the AP belongs to the network normally and the AC can ping the AP device;
If it is managed across three layers, pay attention to whether the option field is set on the DHCP server to ensure that the AP can find the AC for unicast management.
Come on, let's take a look at them one by one, it's very interesting.

3. Troubleshooting and analysis
Step 1: Confirm whether the operation status and wiring of the AP are normal

AP wiring confirmation: Through the network cable label on the POE switch, confirm that the network port light is normally on, and the POE power supply and Link are normal
AP status confirmation: We found the offline AP, and when we looked closely, the AP indicator light was constant, indicating that it has been managed.


What does it mean? It means that the AP device may have been managed, but this is an offline device. Isn't it incredible?

Step 2: Confirm whether the AP has obtained the IP address correctly

The site is a Layer 2 network, so you can directly check whether the IP-MAC of the offline AP has been learned in the ARP table of the core switch:

It is found that it can be learned normally, but because the ARP table entry is changed for too long, it does not necessarily mean that it has been in the network. Then let's check whether it is in the MAC address table of the switch:
If you have learned the MAC address entry and it is located at port 15, it means that the AP device must still be on the network, which basically eliminates the physical connectivity problem.

Step 3: Test whether the AC can ping the AP device

For the AC to manage the AP normally, the two must be connected, so the next step is to ping the AP yourself in the AC diagnostic tool. It is found that it is not pingable, and then use the core switch to ping the AP:
I found that the core can't ping the AP! But it can learn its ARP. What's going on? It's a mystery! Other normal online APs can ping. Next, I will further confirm whether the AC manages APs across three layers. Obviously, this network does not need it, so there is no need to pay attention to option configuration.

Step 4: Unraveling the mystery

So everyone, let's collect the evidence above and analyze it:
The AP is offline, but the link is normal, and the indicator light is constant (managed).
The core switch can learn the ARP of the offline AP,
and the MAC table can find the port corresponding to the AP, which is not running in the network.
The AP can get the IP 192.168.1.12, but the core of 192.168.0.1 and the AC of 192.168.0.253 cannot ping it.
Then, there seems to be only one truth! The offline AP is managed by other AC devices, and the address assigned to the AP by the illegal AC should be 192.168.1.X/24. Reverse deduction perfectly explains the above three evidence chains! How to verify? Capture packets over the air interface.
Step 5: Verify the existence of illegal AC by capturing packets over the air interface

Because the access to the POE switch is foolproof, it is impossible to connect to the port monitor AP to confirm its interaction status. But you can directly capture the air interface broadcast packet to confirm, why? Because AP will have broadcast interaction during DHCP interaction. Let's see, apart from the core switch to assign it an address, who else is there!
After restarting an AP, I saw that the core 0.1 was released and the address 1.253 was assigned to it. Let's take a look at the content:

OK, it is confirmed that the illegal AC device has been found. The IP of this device is exactly 192.168.1.253. Then log in through this IP and take a look:

Sure enough, all offline APs are online on it. But why did this illegal AC appear on the network? It's funny to say that it's because the switches on site are not enough, and I just found a "switch-like" device that can be directly plugged in and used. Hey, I found that the wired one also works.
4. Principle and solution
(1) Principle of the fault

The illegal AC is used as a switch to access the network. When the AP is restarted, it will be assigned the 192.168.1.0/24 IP address. Therefore, these APs cannot communicate with the switches and legal ACs in the 192.168.0.X network segment, and can go online on the illegal AC.

(2) Solution

Remove the illegal AC device 192.168.1.253 from the network.