Let’s learn together the life-saving principles when encountering major operation and maintenance problems

2024.04.29

If you encounter a major operation and maintenance problem, what are the correct measures to take? It is very important to understand this. If you make the wrong choice of strategy, you may lose your job. A few days ago, I was chatting with a DBA who had experienced a very famous failure ten years ago. In the end, he inevitably asked about that accident. I really like to listen to others talk about lessons rather than experiences, because successful experiences are often similar, and only lessons can’t be bought by money. Although reviewing painful lessons may be cruel to those involved, such review is often a refinement of value.

After reviewing the incident, he said that our biggest wrong decision at that time was to stop the third-party replication equipment according to the manufacturer's recommendations. In fact, in this scenario of business peaks and equipment performance failures, many factors are uncertain. We also know very little about the characteristics of third-party equipment. We should not do this kind of operation at that time. Instead, we should first limit the business flow to keep the system running, and then do high-risk operations after the business office is off work and not during peak business hours. Actions. If so, the accident might have been avoided.

The issue he talked about is the first principle I want to talk about today. Among various disposal strategies, first choose the simplest and least risky disposal strategy; among the responsibilities you bear, you should choose the responsibility with the smallest responsibility. For example, although the system operating performance has dropped significantly, but it is still within the tolerable range of the business and there is no sign of deterioration, we can choose to bear the responsibility for this performance failure. If we don’t want to take on this responsibility and have to solve the problem in a short time, then we should try our best to make optimization and adjustments within our own capabilities. If the fault at that time was beyond one's ability, it would be better to bear this smaller responsibility than to take the risk of making a mistake and thereby assuming a greater responsibility.

In actual work, it is not easy to understand this and follow the above principles. What we see in actual work are often cases where smaller operation and maintenance failures lead to super-large failures due to improper handling. For example, if a node in Oracle RAC fails and goes down, what should we do at this time? Most friends may choose to restart, while some may choose to wait and see and do nothing.

In fact, if it is some core business system with high load, then we should first check the logs of the living nodes to see if there are any abnormalities and whether there is a risk of downtime. Then observe the number of active sessions, number of sessions, load, waiting events, etc. of the live nodes to see if there are any risks. If there is a risk, first stabilize the system by killing the session. After everything is stable, analyze the cause of the outage and determine the risk of restarting the failed instance.

If you are unable to judge the risk and the business peak happens to be at that time, you can choose not to restart the faulty node for the time being and deal with it after the business peak passes. The most taboo thing is to restart the failed node shortly after RAC failover and before the business is stable. Tragic examples of this approach abound.

The second principle is not to think that everything is under your control. As a DBA, there are too many things you don't understand in the data center, so you must leave room for consideration when considering problems. Don't choose the solution that seems best.

About fifteen years ago, a certain company's data center experienced a power outage in both computer rooms. Although the data center is powered by two lines of power, both lines of power from the power supply company failed at the same time. This kind of failure is caused by the choice of dual-channel power supply during the construction of the data center in order to save money. Although the two-channel power comes from two 220KV substations, the upper substation is the same. If the upper-level substation fails, both channels of power will be lost, and The power company was unable to give a clear timeframe for repairs.

When dealing with this problem, I had a phone call with their IT director to discuss strategies. My strategy was to stop the core business systems and storage first, and leave the peripheral systems running first. My reason is that it is midsummer and if there is no call for three or four hours, although the UPS can survive, the temperature in the computer room will be too high and the core system will be shut down, which will cause downtime for several hours. However, the IT director disagreed with this plan. He believed that if the peripheral systems were shut down and the power supply could be restored within eight hours, and his UPS could also hold up, and the core system was saved, it would be a great achievement. Regarding the temperature of the computer room, he immediately found an ice making company and asked them to send ice cubes to the computer room to cool down.

In the end, the temperature and humidity in the computer room exceeded the standard, causing the core storage system to automatically protect itself and shut down automatically due to damage. A large number of bad blocks appeared in the core system database, the ADG backup machine storage also failed, and the tape library tapes were damaged and could not be restored. Finally, we used BBED to help him forcefully pull up the database, export the data, rebuild the database, and supplement the lost data. It took two days for the core system to resume internal services, and it took one week to resume providing external order checking services, which had a great impact on the company's reputation.

When some particularly serious operation and maintenance failures occur, it is an important principle for DBAs to choose the measures to be taken based on their own capabilities, and first consider those methods that are less risky and harmful and that they are better at handling. This is an important principle for DBAs to save their lives. Once this kind of accident turns into a major failure, someone must take responsibility, and the DBA is the best scapegoat.