Misconfiguration of routing during cutover caused the entire network to be paralyzed
On October 26, KT's official website issued a public
letter of apology in the name of the CEO and confirmed the cause of the
accident. Although KT has announced the cause of the failure, it has not
provided a detailed explanation, which still leaves a lot of mystery to the
industry...
Starting at 11:20 local time on October 25, 2021, the
entire wired and wireless Internet services of the Korean operator KT were
interrupted nationwide, causing all its customers to be unable to connect to
the Internet and use telephone services for about an hour.
At the same time, due to the interruption of KT's entire
network service, it also concurrently caused network congestion for two other
Korean operators, SKT and LG U+.
It can be said that in this hour, the South Korean
communications industry encountered a "darkest moment."
On October 26, KT's official website issued a public letter
of apology in the name of the CEO and confirmed the cause of the accident.
The letter stated that the cause of this major network accident
was initially estimated to be caused by an external DDoS attack, but it was
finally confirmed that the reason was: when the router was replaced to upgrade
the network, it was caused by an incorrect network routing setting.
Incorrect routing settings may lead to incorrect data flow
direction and cause local node load overload, which will cause the entire
network to be paralyzed.
It seems that similar to the major network failure
encountered by Japan's Softbank a few days ago, this accident was also caused
by cutover.
Although KT has announced the cause of the failure, it has
not provided a detailed explanation, which still leaves a lot of mystery to the
industry...
1. Why is there no self-healing protection?
The telecommunication network has always been known for its
high stability and high reliability. As early as the PSTN telephone network
era, the network line was equipped with 1+1 protection or self-healing
protection. After the main line is interrupted, it can usually be automatically
switched to within 50ms. Alternate lines, or detour connections from the
opposite direction.
In November 2018, a fire broke out in KT’s
telecommunications building in the center of Seoul’s Ahyeon district, which
paralyzed the area’s network and interrupted communications services for
several days. Afterwards, people in the South Korean industry questioned that
this may be because KT's network topology design was not advanced enough, and
the redundant configuration and self-healing protection of equipment and lines
were not perfect. He believes that although KT's backbone aggregation network
has sufficient redundancy configuration and self-healing protection mechanism,
in the expensive and huge access network part, the redundant configuration is
not enough, and the ring self-healing protection design is not perfect, which
leads to The network was interrupted for a long time after the fire.
However, this accident is much more serious than the fire
accident in 2018. The fire accident affected a region, while the accident affected
the whole country. Since the scope of the impact is so wide, it can be
estimated that the failure point of this accident is not at the network access
layer, but at the core part of the network backbone. This is like a blockage of
the "aorta", which leads to the poor flow of "blood"
throughout the network.
The question is, does KT's core network also lack a complete
self-healing protection mechanism? This is obviously impossible. Is there
another reason?
2. Is it caused by a BGP routing configuration error?
BGP routing errors can prevent data packets from reaching
their expected IP addresses and servers and cause service interruption.
Recalling that the recent outages of Facebook, Instagram and WhatsApp services
are all caused by BGP routing issues, some insiders speculate that it may be
caused by BGP configuration errors.
3. Why is the cutover operation performed during the day?
In order to avoid affecting network services, cutover
operations are generally carried out in the early morning, which is common
sense in the telecommunications industry. But this accident was caused by
"replace the router to upgrade the network", and the accident
occurred during the day. Is there any last resort for the cutover operation at
11 o'clock in the daytime? Or is it because the cutover operation occurred in
the early morning and the failure occurred during the day?
4. Is it an equipment problem or a manual error?
The routing configuration may be automated or manual
operation. Is it a problem with the equipment or a manual operation error?
Whether it is equipment problems, human error, or lack of a
backup system, some people in the South Korean industry said that this reflects
KT's negligence in network and process management.