Online Fault Review | The RPC Thread Pool is Full, 1024 Threads Are Not Enough?

2024.10.03

1. Fault Background

Last night, when I just got home and opened the company group, I saw someone in the group discussing: a large number of RPC request errors appeared in the online environment, and the reason for the exception was rejection by the thread pool. Although the number of exceptions was large, the abnormal service was not a core service and belonged to the system bypass, serving the data verification task. Even if there were a large number of exceptions, there was no actual impact.

It turned out that someone was flushing data online, which generated a large amount of binlogs. The number of requests for data verification tasks increased significantly, causing the thread pool to be full. Since it was not my job and I was not familiar with this part of the business, I did not pay special attention to it.

The next day I thought about it carefully and felt that there were many doubts. The deduction process was too simple, the chain of evidence was insufficient, and the final conclusion was not solid. There might be other reasons for the root cause of the problem.

1.1 Doubts

The number of requests has increased significantly. What was the number of requests before and after the increase?
The thread pool is full. What are the initial and maximum values of the thread pool? What is the length of the thread pool queue?
What is the thread pool rejection policy?
Which interfaces are affected? How do the time consumption of these interfaces fluctuate?
What is the CPU load and GC of the service?
Is the thread pool full simply because of a sharp increase in request volume?

With the above questions in mind, as soon as I arrived at the company the next day, I couldn’t wait to open various monitoring panels and start troubleshooting the problem. In the end, I really found the root cause of the problem.

Because the company’s monitoring system has a watermark, I can only state the conclusion and cannot take screenshots.

2. Troubleshooting process

2.1 Fluctuation of request volume

The QPS of a single-machine RPC increased from 300/s to 450/s.
Kafka message QPS 50/s with no significant fluctuations.
There are no other request entries and no scheduled tasks.

This can also be called a sharp increase in the number of requests. Can an increase of 150 requests/s blow up the thread pool? Just fooling the boss like this... From this, I firmly believe that there is another root cause for the failure.

2.2 RPC thread pool configuration and monitoring

Not all ports on the line were overwhelmed, only one RPC port 8001 was overwhelmed. So I specifically checked the thread pool configuration of 8001.

Initial number of threads: 10
The maximum number of threads is 1024 (too large, the configuration is a bit arbitrary)
Queue length 0
The rejection strategy is to throw an exception and reject immediately.
From 20:11 to 20:13, the number of threads increased linearly from the initial number of 10 to 1024.

2.3 Thinking

Does 450 QPS/second require 1024 threads? According to my experience, as long as the interface takes less than 100ms, it is impossible to need so many threads. It is too strange.

2.4 Interface time fluctuation

The average interface time increased from 5.7 ms to 17,000 milliseconds.

The interface time consumption increased significantly. I communicated with them later and they also looked at the interface time consumption monitoring. They believed that the reason why the average time consumption was so high was because the RPC requests were queued, which increased the processing time, so the average monitoring time consumption increased significantly.

This is their misunderstanding, and there are two mistakes.

The queue length of this RPC interface thread pool is 0, and the rejection strategy is to throw an exception. When there is no available thread, the request will be rejected immediately, and the request will not be queued, so there is no waiting time in the queue.
The company's monitoring system is divided into server-side monitoring and call-side monitoring. The server-side time consumption monitoring does not include the time for processing connections and the time for queuing in the RPC thread pool. It only includes the time it takes for the RPC thread pool to actually process requests. The RPC call-side monitoring includes RPC network time consumption, connection time consumption, queuing time consumption, business logic processing time consumption, server-side GC time consumption, etc.

They mistakenly believed that the sharp increase in time consumption was due to the queuing of requests, and thus ignored the crucial clue: the performance of the actual processing phase of the interface was seriously deteriorated and the throughput was greatly reduced, so the thread pool increased significantly until it was full.

Next, I began to analyze what was the root cause of the deterioration of interface performance?

Is the CPU fully utilized? Is the request interface performance deteriorating?
Frequent GC leads to poor interface performance?
The time taken to call the downstream RPC interface has increased significantly?
Calling SQL consumes a lot of time?
Calling Redis takes a long time
Other external calls take a long time?

2.5 Other time-consuming monitoring situations

I quickly checked all possible external call times and found no significant fluctuations. I also checked the machine load and found that the CPU and network loads were not high. Obviously, the root cause of the problem was not in the above directions.

The CPU load is extremely low. During the failure period, the cpu.busy load was 15%, which is lower than the peak time at noon. Obviously, the root cause is not the high CPU load.
GC is in good condition. No FullGC, youngGC twice a minute (frequent youngGC will cause high CPU load and deteriorate interface performance)
There is no obvious fluctuation in the time consumption of the downstream RPC interface. I checked the time consumption monitoring of the service calling the RPC interface, and there is no obvious fluctuation in the time consumption of all interfaces.
There is no obvious fluctuation in the time taken to call SQL.
There is no obvious fluctuation in the time taken to call Redis.
There is no obvious fluctuation in other downstream system calls (such as Tair, ES, etc.)

2.6 Start studying the code

Why didn't I look at the code at the beginning? Because this content was not my responsibility and I was not familiar with the code.

Until I opened the code and took a look, I was disgusted. The code is very complicated, with many branches, very deep nesting levels, and long and smelly methods. It can be called the Mount Everest of the code shit mountain. I will vomit after taking one more look. There are nearly 10 internal branches in the interface, and each branch method is a large piece of code.

This interface is an SPI interface defined by the upstream BCP verification system. It is an aggregate interface, not a single responsibility interface. After looking at it for 10 minutes, I still couldn't find the root cause of the problem. So I changed the direction of troubleshooting and started to check the abnormal trace.

2.7 Found key clues from the abnormal trace

The infrastructure capabilities of my company are still very strong. The system's exception trace marks the processing time of each stage, including the time of all external interfaces, such as SQL, RPC, Redis, etc.

I found that it was indeed a problem with the internal code processing, because the trace showed that the system paused for more than 1 second between two SQL requests. I don't know what the system executed during this 1 second. I checked the time consumed by these two interfaces, and the monitoring showed that the SQL was executed very quickly, so it should not be a SQL problem.

The machine does not have FullGC. What is the reason?

As mentioned earlier, the fault interface is an aggregate interface. I don't know which branch has the problem, but the specific branch is indicated in the exception trace.

I started to investigate the specific branching method... However, after searching for a long time, I still couldn't find the reason...

2.8 When you think there is no way out, you will find a new village.

This pile of shitty code made me feel really sick. I meditated quietly for a minute before I recovered.

What are the possibilities of blocking a thread without external calls?
Is there any lock? Synchiozed keyword?

So I searched for the keyword Synchiozed, but found nothing. There was basically no locked area in the code.

It was almost noon and I was very hungry, just when I was about to give up. I took a quick look and saw the Guava current limiter in the class attribute declaration.

A beating heart, trembling hands

1. private static final RateLimiter RATE_LIMITER = RateLimiter.create(10, 20, TimeUnit.SECONDS);

Rate limiter: 10 calls per minute.

So I immediately checked the usage scenario of the current limiter, which was exactly the same as where the exception trace was blocked.

A noticeable smile appeared at the corner of his mouth.

The case is solved, but there is only one truth.

3. Conclusion

The threshold of Guava's current limiter is too low, with a maximum of 10 requests per second. When the concurrency exceeds this threshold, a large number of threads are blocked, and the RPC thread pool continues to add new threads to handle new requests until the maximum number of threads is reached. After the thread pool reaches its maximum capacity, it can no longer receive new requests, resulting in a large number of subsequent requests being rejected by the thread pool.

So I started to create a group and invite people. I pulled relevant classmates and bosses into the group. I sent relevant screenshots and conclusions to the group.

Since it was not an urgent issue, I went to have lunch happily. The next thing was that they optimized the code.

Don’t panic when problems arise, and don’t eat melons or seeds. Take action, this is your Conan moment

新聞