Do you know what disaster tolerance is?
Whether it is 3G, 4G or 5G, each operator has established a
disaster recovery site for the communication business in the region. In
addition to ensuring the reliability of various types of data, the disaster
recovery site can also ensure the availability of services. This is the most
important significance of establishing a disaster recovery system.
You have no accident. If an earthquake or other natural
disaster occurs, will your account information in a certain bank be lost?
In fact, this does not require us to worry about it. Bank
user information is very important data. Therefore, before the bank is opened,
it must have a complete set of disaster tolerance solutions to solve such
problems, that is, establish a disaster tolerance system.
The so-called disaster tolerance system, literally
understood, refers to a system that tolerates disasters. With this system, even
if a disaster strikes, the user data in the bank can be safe. This disaster can
range from a natural disaster to a physical failure. In order to ensure data
security and business reliability, each bank will establish one or several
disaster recovery sites in multiple regions of the country to back up data with
each other, which is equivalent to creating several "copies".
When the primary site is damaged by a disaster, resulting in
permanent data loss, the "copy" of the disaster-tolerant site can be
used to achieve data recovery. If there is no data backup at the disaster
recovery site, it will cause immeasurable economic losses and bad social
impact.
Banks have disaster tolerance systems. Does communications
have disaster tolerance systems? The answer is yes. Each operator has developed
a complete and rigorous disaster recovery solution to protect the communication
data of all users, including your call balance!
Whether it is 3G, 4G or 5G, each operator has established a
disaster recovery site for the communication business in the region. In
addition to ensuring the reliability of various types of data, the disaster
recovery site can also ensure the availability of services. This is the most
important significance of establishing a disaster recovery system.
Take 5GC as an example. Similar to the financial system, 5GC
also adopts cross-regional disaster recovery deployment methods. Deploy data
centers (DCs, Data Centers) in two different regions. When the network
equipment in a DC is unavailable for some reason, another DC will quickly take
over the business to ensure the availability of 5G services. This is the dual
DC deployment. Dual DC deployment is a way to achieve deployment reliability.
In addition to dual DC deployment, what measures does the 5GC disaster recovery solution provide to achieve disaster recovery?
Pic
Deployment reliability
The 5GC disaster recovery solution supports dual DC
deployment, mutually exclusive deployment, NF (Network Function) domain
deployment, and network dual planes to achieve deployment reliability.
Mutually exclusive deployment refers to deploying virtual
machines on different physical machines to ensure that when a physical machine
is abnormal, other virtual machines can still provide services. In short, it is
"don't put eggs in one basket".
NF deployment by domain means that NF deployment adopts the
separation of management domain, service domain, and forwarding domain.
Network dual plane means that all logical network interfaces
of 5GC NF have at least two different physical network planes as backups for
each other. When one of the network planes fails, the other network plane can
take over all network traffic to ensure uninterrupted services.
Architecture reliability
The 5GC disaster tolerance solution supports load sharing
and stateless features to achieve architectural reliability.
Load sharing means that all running instances share the
processing business together. When some running instances go down abnormally,
the rest of the running instances share the processing business, so as to
ensure the normal operation of the business. Load sharing adopts the N+M
redundancy mode, that is, when N instances can meet the business processing of
the system capacity, M instances are provided for redundancy.
The above figure is a typical 3+1 redundancy mode. When 3
instances can meet the business processing of the system capacity, another
instance is provided for redundancy. When any one running instance fails, the
other three running instances continue to work, so as to ensure that the system
capacity and business processing are not affected.
Stateless refers to the stateless design of microservices.
3GPP defines UDSF (Unstructured Data Storage Function) for uniform storage of
NF state data (also known as unstructured data, such as mobile data, etc.).
The business logic APP can perform elasticity, expansion,
destruction, rebirth, and migration operations at any time. This is the
separation of computing and storage. Stateless design not only achieves
architecture reliability, but also guarantees data reliability.
Data reliability
5GC's NF is designed in a stateless manner. After the
current processing flow is completed, state data such as the user and session
context are stored in the UDSF, and the UDSF saves multiple copies of the data.
UDSF supports 1+1 redundancy and dual-DC deployment disaster recovery mode to
ensure data reliability.
Resource reliability
The 5GC disaster tolerance solution supports mechanisms such
as link detection and self-healing to achieve resource reliability.
Self-healing refers to: For business processing nodes that
continue to fail, the system will perform multi-level self-healing of the
nodes. According to the user's self-healing strategy configuration, the system
successively adopts the strategy of restarting the container, re-pushing the
container, restarting the virtual machine, and rebuilding the virtual machine
step by step for self-healing, so as to restore the business as soon as
possible.
Link detection means that the business node will
periodically send heartbeat keep-alive messages to the management node. If the
management node detects a node that has not sent a heartbeat for a long time,
it is judged as a faulty node, which triggers the business migration process
and migrates the business of the faulty node To other normal nodes, so as to
ensure the reliability of the business.
Well, having said that, you don't have to worry about your
balance of phone bills anymore, let alone the money stored in the bank! You
also know what disaster tolerance is.