Misconfiguration of routing during cutover caused the entire network to be paralyzed

On October 26, KT's official website issued a public letter of apology in the name of the CEO and confirmed the cause of the accident. Although KT has announced the cause of the failure, it has not provided a detailed explanation, which still leaves a lot of mystery to the industry...

 

Starting at 11:20 local time on October 25, 2021, the entire wired and wireless Internet services of the Korean operator KT were interrupted nationwide, causing all its customers to be unable to connect to the Internet and use telephone services for about an hour.

 

At the same time, due to the interruption of KT's entire network service, it also concurrently caused network congestion for two other Korean operators, SKT and LG U+.

 

It can be said that in this hour, the South Korean communications industry encountered a "darkest moment."

On October 26, KT's official website issued a public letter of apology in the name of the CEO and confirmed the cause of the accident.

 

The letter stated that the cause of this major network accident was initially estimated to be caused by an external DDoS attack, but it was finally confirmed that the reason was: when the router was replaced to upgrade the network, it was caused by an incorrect network routing setting.

 

Incorrect routing settings may lead to incorrect data flow direction and cause local node load overload, which will cause the entire network to be paralyzed.

 

It seems that similar to the major network failure encountered by Japan's Softbank a few days ago, this accident was also caused by cutover.

 

Although KT has announced the cause of the failure, it has not provided a detailed explanation, which still leaves a lot of mystery to the industry...

 


 

1. Why is there no self-healing protection?

The telecommunication network has always been known for its high stability and high reliability. As early as the PSTN telephone network era, the network line was equipped with 1+1 protection or self-healing protection. After the main line is interrupted, it can usually be automatically switched to within 50ms. Alternate lines, or detour connections from the opposite direction.

 

In November 2018, a fire broke out in KT’s telecommunications building in the center of Seoul’s Ahyeon district, which paralyzed the area’s network and interrupted communications services for several days. Afterwards, people in the South Korean industry questioned that this may be because KT's network topology design was not advanced enough, and the redundant configuration and self-healing protection of equipment and lines were not perfect. He believes that although KT's backbone aggregation network has sufficient redundancy configuration and self-healing protection mechanism, in the expensive and huge access network part, the redundant configuration is not enough, and the ring self-healing protection design is not perfect, which leads to The network was interrupted for a long time after the fire.

 

However, this accident is much more serious than the fire accident in 2018. The fire accident affected a region, while the accident affected the whole country. Since the scope of the impact is so wide, it can be estimated that the failure point of this accident is not at the network access layer, but at the core part of the network backbone. This is like a blockage of the "aorta", which leads to the poor flow of "blood" throughout the network.


The question is, does KT's core network also lack a complete self-healing protection mechanism? This is obviously impossible. Is there another reason?

 

2. Is it caused by a BGP routing configuration error?

BGP routing errors can prevent data packets from reaching their expected IP addresses and servers and cause service interruption. Recalling that the recent outages of Facebook, Instagram and WhatsApp services are all caused by BGP routing issues, some insiders speculate that it may be caused by BGP configuration errors.

 

3. Why is the cutover operation performed during the day?

In order to avoid affecting network services, cutover operations are generally carried out in the early morning, which is common sense in the telecommunications industry. But this accident was caused by "replace the router to upgrade the network", and the accident occurred during the day. Is there any last resort for the cutover operation at 11 o'clock in the daytime? Or is it because the cutover operation occurred in the early morning and the failure occurred during the day?

 

4. Is it an equipment problem or a manual error?

The routing configuration may be automated or manual operation. Is it a problem with the equipment or a manual operation error?

 

Whether it is equipment problems, human error, or lack of a backup system, some people in the South Korean industry said that this reflects KT's negligence in network and process management.