I would like to say a few more words about this communication failure...

2022.07.18
I would like to say a few more words about this communication failure...

In this article today, I want to expand the topic a bit and talk to you in depth - it is 2022, why there are still so many failures in our communication network, and whether we have the ultimate solution.

​In the past few days, everyone is concerned about the large-scale communication failure of Japanese telecom operator KDDI.

This failure has a great impact, involving the whole territory of Japan, with a total of 39.15 million users. Moreover, the failure lasted for a long time, it took almost two days to be considered as a basic recovery.

The specific cause of the failure, I saw that many public accounts have already written, I will not repeat the analysis.

In this article today, I want to expand the topic a bit and talk to you in depth - it is 2022, why there are still so many failures in our communication network, and whether we have the ultimate solution.

picture

Communication failure: a century-old game

Failures are a natural property of communication networks. Just like people get sick, communication networks have been accompanied by failures since their inception. In other words, it is in the process of solving the fault that we created the communication network.

picture

Papa Bell invented the telephone after solving numerous malfunctions

For more than 100 years, countless communicators have been persistently fighting and gaming with faults. They worked hard to develop various technologies, adopted various means, and fought against communication failures.

From a macro perspective, the effect of the struggle is significant. With the continuous accumulation of experience and the continuous progress of technology and technology, the probability of failure of the communication network is constantly decreasing.

Young readers may not know that, more than 20 years ago, the failure of landline telephones (there are not many households with telephones) was as common as water and power outages. More than 10 years ago, it was also a common phenomenon that the mobile phone could not be dialed and the Internet could not be accessed.

picturepicture

These phenomena have become less common in the last decade. It happens once in a while, and everyone finds it weird. When the network is disconnected, the first reaction of many people is that the mobile phone is broken, or the payment is in arrears, and quickly restart or recharge. isn't it?

In the information society we live in now, communication networks are as important infrastructure as water and electricity. Our work and life, as well as the operation of all walks of life, are inseparable from the communication network.

Under this premise, as a state-owned enterprise, as a network builder and maintainer, communication operators will always put the security and stability of the network first.

picture

For network stability, the Ministry of Industry and Information Technology has set strict assessment indicators for operators. If there is a network failure in a certain province or city, the leader will definitely be held accountable, and his career will be worrying.

The pressure of operator leaders will be passed on to employees, as well as equipment vendors and outsourcers.

Now that the market competition is so fierce, if an accident occurs, it will either be a huge indemnity, or it will lose the market share of this province. This is a loss that equipment manufacturers and outsourcers cannot bear.

Therefore, the entire communication industry must pay enough attention to the security and stability of the communication network. The key is still a question of capacity and execution.

Where are the weaknesses of the communication network?

First of all, I want to tell you about the definition of the security level of the communication network.

According to different scenarios, the security of communication network is divided into different levels. From low to high, they are home-level, enterprise-level, and carrier-level.

picturepicture

The security level of the communication system

Like the routers we use at home, they all belong to the family level. The security reliability of this kind of equipment is very low, and if it is bad, it will be bad, which can easily lead to network interruption.

The enterprise level is the network equipment used in the unit. Depending on the size of the network and the number of users, enterprise-grade devices have high security reliability and are less prone to service interruptions.

Carrier-class requirements are even higher. Like China Mobile, Telecom and China Unicom, their networks must provide services for hundreds of millions of users, and they are absolutely not allowed to fail easily. Generally speaking, the reliability of carrier-class should meet the standard of more than 5 9s.

picture

The communication network that Xiaozaojun is talking about today refers to the public communication network that operators face to the public, including both cellular mobile communication networks and fixed-line broadband networks. They are all carrier grade.

The architectures of cellular mobile communication networks and fixed-line broadband networks are actually similar, and the main difference lies in the access network part.

picture

A cellular mobile communication network is a wireless access network, and the access device is a base station. The fixed-line broadband network is a wired access network, and the access device is a PON device (passive optical network device, including an optical modem).

We take the cellular mobile communication network as an example to analyze.

Public communication networks serve hundreds of millions of user groups. Therefore, a pyramid-level architecture is usually adopted, with the core network as the core, the transmission network (bearer network) as the backbone, and the access network as the limbs.

picture

Everyone can see at a glance that the biggest weakness of this architecture lies in the core network and transmission network (especially the backbone network).

The core network is the management center, the heart and brain of the network. Once it fails, the entire network fails. Therefore, the core network engineer (such as me back then) is the position with the greatest risk and pressure.

picture

Core network room

The transmission network (bearer network) is the blood vessel and nerve of the communication network. It’s okay to say that the terminal is damaged, and it affects a small piece at most, but what if the cardiovascular and cerebrovascular blood vessels are damaged? It was also completely paralyzed.

picture

Optical transmission equipment

The failure of KDDI this time, the failure of DoCoMo in October 2021, the failure of the four major operators in the UK in 2020, and the failure of CenturyLink in the United States in 2020 are all related to core routers. To put it bluntly, there is a problem with the cardiovascular and cerebrovascular, and the whole person (the network) is paralyzed.

In contrast, the probability of major problems on the access network side is very low. Individual base stations are "dropped", affecting hundreds of thousands of people at most, with a small range and controllable complaints.

picturepicture

base station equipment

If there is a large-scale failure of the access network, it is most likely a software version problem of the equipment manufacturer, or a hardware batch problem. The probability of this situation is extremely low.

What do the correspondents do to prevent failures?

Then, in order to ensure the safe and smooth operation of the communication network and prevent the occurrence of failures, what measures have our communicators adopted?

(1) First, it is the improvement of the top-level architecture design.

The architecture of the network is the foundation of network security. A good architecture considers both performance and capacity, as well as cost, as well as security and redundancy.

Please remember one thing here: communication equipment is a complex product, no matter how you design or stack it, it has the possibility of failure, it is just a matter of probability and time.

Instead of guarding against possible failures, it is better to focus on what to do after a failure occurs.

Therefore, the introduction of a backup mechanism is the most effective means to deal with failures.

picture

Backup mechanism

Everyone has learned "probability and statistics". If the probability of failure of one device is 1%, then the probability of two devices failing at the same time is 1% × 1% = 0.01%. right?

In order to ensure absolute security, when designing the network architecture, the POOL (pool) networking method will be adopted, as shown in the following figure:

picture

Several devices together form a pool (POOL), each responsible for the business. If one of them is broken, the others will be installed immediately to ensure that the business will not be affected.

Core equipment, usually two or more, are located in different areas of the capital city, and are physically far apart.

In addition, when designing the network architecture, important equipment and network elements are usually placed in the core computer room with a higher security level.

picturepicture

core computer room

For example, the most important HSS in the mobile communication network, which is responsible for storing and managing user data (the former HLR, which contains each user's mobile phone number, authentication data, business information, etc.), is stored in the provincial capital city. core computer room. At the same time, maintenance personnel will regularly perform physical off-site isolation backups of data.

In recent years, due to geological disasters, wars or terrorist attacks and other factors, operators have even begun to do backups in different provinces.

For example, in the flood disaster in Zhengzhou last year, when the core computer room was flooded and the HLR service was withdrawn, the HLR in the neighboring provincial capital city was urgently activated to achieve temporary business recovery.

picture

pictureDifferent disaster recovery levels

(2) The second method is the underlying active-standby mechanism.

We just talked about the redundancy mechanism of the top-level design. Specific to the equipment room, rack, single board, and cable, there are also active and standby designs, which can be called the underlying active-standby mechanism.

If you have been to the computer room, you will find that the racks on the cabinets are plugged with various veneers. And these veneers basically come in pairs.

picture

The front appearance of a manufacturer's 3G equipment

That is to say, there are usually two boards of a certain type.

The same is true for network cables and optical fibers. You hardly see a single cable, they are all in pairs.

picture

The front appearance of a manufacturer's 4G equipment

The reason for this is to back up each other. If one board is broken, the other board can continue to work to ensure that the business is not affected. At the same time, the system will alarm to remind the staff to replace it as soon as possible.

The same goes for the power supply. All cabinet equipment in the telecommunications room must have at least two power inputs.

picturepicture

Multi-channel power input (one red and one blue for one channel)

In addition to the mains power supply, emergency power supply equipment such as batteries, UPS, and generators will also be installed in important computer rooms.

picturepicture

​The battery pack in the engine room

(3) Third, improve the management system and regulations.

Technology is never the only factor that affects network security and stability. The greatest threat to communication networks is actually people, not technology.

For this, Xiaozaojun believes that every correspondent will have the same feeling.

In terms of management processes and systems, in terms of engineering and technical specifications, we have learned countless bloody lessons.

Why do upgrade plans need to be reviewed repeatedly? Why are engineering specifications so strict? Why build a spare parts warehouse? Why should the cutover step be double-checked or even triple-checked? Why do you need to arrange on-duty after a major operation? Why is the Internet closed on important holidays? ...

These are the experiences summed up by predecessors stepping on thunder.

picture

Always be in awe of network failures

In addition to the internal management system and process standards, the state has also established increasingly strict laws and regulations to punish the frequent occurrence of deliberate sabotage of communication networks.

Such as illegal construction shoveling optical fibers, deliberately destroying base stations, and cutting optical fibers, they will all be punished by law.

picturepicture

Maliciously cut base station feeder

The deep-seated reasons behind communication failures

With a reasonable network architecture design, complete master and backup mechanisms, and sound systems and specifications, why do so many failures still occur?

Next, let me talk about the deeper reasons.

First of all, the first point, and it is estimated that everyone agrees the most, is the involution environment of the communication industry.

In recent years, malicious competition and winning bids at low prices have prevailed. Equipment manufacturers and subcontractors have to grab orders and maintain profits. They can only desperately drive down costs, such as product design costs, material costs, and construction material costs. More importantly, the cost of staff wages.

The continuous cost reduction will inevitably affect product reliability and engineering quality. Too low wages, resulting in a large number of experienced brain drain. In order to complete the work, subcontractors can only recruit fresh graduates, and after simple training (or even no training), they will be dispatched to the site to work.

These personnel lack the necessary training and practice, and the quality level and technical ability are insufficient, which has become a great risk point.

There are very few people with low quality who have been oppressed so badly, it is not impossible to delete the library and run away.

In the past few years, in order to ensure that front-line employees are not deducted, some manufacturers even signed contracts with subcontractors to restrict the bottom line of income of outsourced employees.

In addition to low price competition, another important factor affecting the security of network operation is the increasing technical complexity.

The more advanced the technology, the higher the complexity and the lower the reliability. With the evolution of technology, the network scale of operators has become larger and larger, and the networking has become more and more complex, and the probability of problems has greatly increased.

The tidal effect of communication network is very obvious. There is sometimes a tenfold or even a hundredfold difference between free time and busy time. In the event of an unexpected event (disaster, etc.), traffic surges, more likely a thousand-fold difference.

It is impossible for operators to do a thousand times redundant design. Therefore, if there is no reasonable bypass design or threshold design, the probability of network congestion is extremely high. (Several major failures in recent years have caused signaling traffic congestion.)

At present, the complex networking of operators, few people can fully understand it. After a long time, when people move, it becomes even more unfamiliar.

The communication network is a metaphysics, and there are many strange problems. Who dares to say that he can accurately calculate every possibility?

The third potential cybersecurity risk, which Xiaozaojun is most worried about, is external cyberattacks. Such as hackers, viruses and system vulnerabilities.

Nowadays, communication devices are basically IP-based and cloud-based, networks are becoming more and more open, and some are directly deployed on the public cloud, and the physical isolation from the outside world is getting weaker and weaker, making them more vulnerable to attacks than before.

Today's attackers have a much higher level than before, and their means are more diverse, posing a great threat to the network.

Of course, operators and equipment vendors have also invested heavily in preventing cyber attacks.

Now, all manufacturers focus on the concept of "security hardening". As the name implies, security hardening is to block system loopholes and make the system more stable. Operators will use third-party tools or hire third-party manufacturers to conduct security scans on existing network equipment to find security loopholes, and then require equipment manufacturers to rectify and block them.

picture

all for safety

This game of "the Tao is one foot high, the devil is one foot high" will continue for a long time.

However, Xiao Zaojun personally believes that the current defensive side has big problems in terms of personnel safety awareness and technical capabilities. In the future, we will encounter more and more security incidents.

It is hoped that the relevant units and departments will not put safety in their mouths, and really spend some effort to improve the quality of their personnel and strengthen training. Otherwise, it will be too late to remedy.

last words

The failure of the Japanese KDDI was not the first, and certainly not the last. A communication network failure is like beating a drum to spread flowers, and no one knows if they are the next.

Now, manufacturers are proposing to introduce AI and let artificial intelligence take over the network, so as to reduce the failure rate of the network. There are also some manufacturers, on the basis of network cloudification, engaging in grayscale upgrades (ie partial upgrades), which can also greatly reduce network risks. These are all good trends.

I feel that we still have a long way to go in combating the failure of communication networks. The road is long and the road is long, and the correspondents are looking for it.

Well, that's all for today's article.