How to troubleshoot 502 problems? Have you learned it?
How to troubleshoot 502 problems? Have you learned it?
When I was just working, once, the older brother who called my service upstream said that your service reported "502 error, go and see why".
At that time, there happened to be a call log in that service, which usually records information about various 200,4xx status codes. So I went to the service log to search for the number 502, but found nothing. So I said to my brother, "There is no 502 record in the service log, are you mistaken?"
Now that I think about it, I'm a little embarrassed.
I don't know how many brothers are the same as me at that time. In this article, let's talk about what the 502 error is?
Let's start with what a status code is.
HTTP status codes
The certain treasures and certain degrees that we usually browse in the browser are actually front-end web pages.
Generally speaking, the front-end does not store much data, and most of the time it needs to get data from the back-end server.
Therefore, it is necessary to establish a connection between the front and back ends through the TCP protocol, and then transmit data on the basis of TCP.
TCP is a data stream-based protocol. When transmitting data, it does not add a data boundary to each message. Directly using bare TCP for data transmission will have a "sticky packet" problem.
Therefore, it is necessary to use a special protocol format to parse the data. So on this basis, the HTTP protocol was designed. For more details, please refer to the article I wrote before, "Since there is HTTP protocol, why should there be RPC".
For example, if I want to see the specific information of a certain product, it is actually the id of the product passed in in the HTTP request sent by the front end, and the price of the product, store name, and shipping address information returned in the HTTP response returned by the back end.
Get product details by id
In this way, on the surface, we are swiping various web pages, but in fact there are many HTTP messages being sent and received behind the scenes.
Users browse products online
But here comes the problem. The above mentioned are all normal situations. If there are abnormal situations, such as the data sent by the front-end, it is not a product id at all, but a picture, which is inconvenient for the back-end server. It may give a normal response, so it is necessary to design a set of HTTP status codes to identify whether the HTTP request response process is normal. Through this you can influence the behavior of the browser.
For example, if everything is normal, the server returns a 200 status code. After the front end receives it, you can use the response data with confidence. However, if the server finds that something sent by the client is abnormal, it will respond with a 4xx status code, which means that this is a client error. The xx in 4xx can be subdivided into various codes according to the type of error. For example, 401is The client does not have permission, 404 is that the client requested a web page that does not exist at all. Conversely, if there is a problem with the server, a 5xx status code is returned.
Difference between 4xx and 5xx
But here comes the problem.
There is a problem with the server, if it is serious, the server may crash directly, so how can it return the status code to you?
Yes, in this case, it is impossible for the server to return a status code to the client. Therefore, in general, the 5xx status code is not actually returned by the server to the client.
They are returned by gateways, common ones like nginx.
The role of nginx
Returning to the topic of front-end and back-end interaction data, if there are few front-end users, the back-end can handle requests with ease. However, as the number of users increases, the back-end server is limited by resources, and the CPU or memory may be seriously insufficient. At this time, the solution is also very simple. Build several more identical servers, so that these front-end requests can be equally distributed to Several servers to increase processing power.
But to achieve this effect, the front end has to know which servers are in the back end, and establish TCP connections with them one by one.
Establish connections between the front end and multiple servers
It's not impossible, but it's a hassle.
But at this time, it would be nice if there was an intermediate layer between them, so that the client only needs to connect to the intermediate layer, and the intermediate layer establishes a connection with the server.
Therefore, this middle layer becomes an agent of these servers. The client finds an agent for everything, just sends its own request, and then the agent finds a certain server to complete the response. In the whole process, the client only knows that its request has been handled by the agent, but the agent specifically finds the server to complete it, and the client does not know and does not need to know.
Like this, the proxy method that blocks specific servers is the so-called reverse proxy.
reverse proxy
Conversely, the proxy mode that blocks out which clients are specific is the so-called forward proxy.
The role of this middle layer is generally played by gateways such as nginx.
In addition, due to the different performance configurations of the servers behind, some 4-core 8G, some 2-core 4G, nginx can add different access weights to them, and multi-forwarding point requests with high weights can achieve different load balancing in this way. Strategy.
nginx returns 5xx status code
With the middle layer of nginx, the client changes from being directly connected to the server to the client directly connecting to nginx, and then nginx is directly connected to the server. From one TCP connection to two TCP connections.
Therefore, when the server is abnormal, the TCP connection sent by nginx to the server cannot respond normally. After nginx gets this information, it will return a 5xx error code to the client, that is to say, the 5xx error is actually caused by What nginx recognizes and returns to the client, the server itself, does not have 5xx log information. That's why the scene at the beginning of the article appears. The upstream received a 502 error from my service, but I couldn't search for this information in my service log.
Common causes of 502
The official explanation for the 502 error code in rfc7231 is
502 Bad Gateway
The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.
- 1.
- 2.
Translated, the 502 (Bad Gateway) status code means that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while trying to fulfill a request.
You listen, do people say anything?
For most programmers, this not only fails to explain the problem, but only creates more question marks. For example, what does the invalid response mentioned above refer to?
Let me explain, it actually means that 502 is actually issued by the gateway proxy (nginx), because the gateway proxy forwards the client's request to the server, but the server sends an invalid response, and the invalid response here , generally refers to the RST message of TCP or the FIN message of four waves.
It is estimated that everyone is very familiar with the four waved hands, so skip it, let's focus on what the RST message is.
What is RST?
We all know that TCP normally disconnects with four waves of hands, which is an elegant approach in normal times.
However, in abnormal cases, both the sender and the receiver may not be normal, and even the wave itself may not be possible, so a mechanism is needed to forcibly close the connection.
RST is used in this situation, and is generally used to close a connection abnormally. It is a flag in the TCP header. After receiving the data packet with this flag set, the connection will be closed. At this time, the party receiving the RST will see a connection reset or connection refused at the application layer. report an error.
TCP header RST bit
There are generally two common reasons for sending RST packets.
Server disconnected prematurely
There is a TCP connection between nginx and the server. When nginx forwards the client request to the server, they will keep this connection logically until the server returns the result normally, and then disconnects.
However, if the server disconnects prematurely, and nginx continues to send messages, nginx will receive the RST message returned by the server kernel or the FIN message with four waves, forcing the connection on the nginx side to end.
There are two common reasons for premature disconnection.
The first is that the timeout time set by the server is too short. No matter which programming language is used, there are generally ready-made HTTP libraries, and the server generally has several timeout parameters. For example, in the HTTP service framework of golang, there is a write timeout (WriteTimeout), assuming that 2s is set. , then its meaning is that the server needs to process the request within 2s and write the result to the response after receiving the request. If it cannot wait, the connection will be disconnected.
For example, your interface processing time is 5s, and your WriteTimeout is only 2s. Before waiting for the response to be written, the HTTP framework will actively disconnect the connection. At this time, nginx may receive four waved FIN messages (some frameworks may also send RST messages), and then disconnect, so the client will receive a 502 error.
If you encounter this kind of problem, just increase the WriteTimeout time.
The relationship between FIN and 502
The second reason, and the most common reason for the 502 status code, is that the server application process crashes.
The server crashes, that is, there is no process currently listening to the server port, and at this time you try to send data to a non-existing port, the server's Linux kernel protocol stack will respond with an RST packet. Similarly, nginx will also give the client a 502 at this time.
RST and 502
During development, this situation is the most common.
Now most of our servers will restart the service that hangs, so we need to determine whether the service has ever crashed.
If you have monitored the CPU or memory of the server, you can check whether the CPU or memory monitoring graph has experienced a sudden drop in a cliff. If there is, nine out of ten, your server-side application has crashed.
cpu crash suddenly
In addition, you can also use the following command to see when the process was last started.
ps -o lstart {pid}
- 1.
For example, the process id I want to see is 13515, and the command needs to be like the following.
# ps -o lstart 13515
STARTED
Wed Aug 31 14:28:53 2022
- 1.
- 2.
- 3.
It can be seen that its last startup time was August 31st. If this time is different from the operation time in your impression, it means that the process may have been restarted after it collapsed.
When encountering this kind of problem, the most important thing is to find out the cause of the crash. There are various reasons for the crash, such as writing to an uninitialized memory address, or memory access out of bounds (the length of the array arr is obviously only 2, the code but read arr[3]).
In this case, almost all programs have code logic problems, and crashes usually leave a code stack. You can troubleshoot the problem according to the stack error report, and it will be fine after repairing. For example, the following picture is the error stack information of golang, and it is similar for other languages.
error stack
The case where the stack is not printed
But there are cases where sometimes the stack is not left at all.
For example, memory leaks cause the process to occupy more and more memory, and finally the maximum memory limit of the server is exceeded, triggering OOM (out of memory), and the process is directly killed by the operating system.
There is even more subtlety, the operation of actively exiting the process is hidden in the code logic. For example, there is a method in golang’s log printing called log.Fatalln(). After printing the log, it will also execute os.Exit() to exit the process directly. Novices who don’t know the source code can easily make this mistake.
Exit the process by the way after printing
If you are clear, your service has not crashed. Then keep looking down.
The gateway sent the request to a non-existent IP
Nginx is configured to proxy multiple servers. This configuration is generally placed in /etc/nginx/nginx.conf.
Open it and you may see a message like the one below.
upstream xiaobaidebug.top {
server 10.14.12.19:9235 weight=2;
server 10.14.16.13:8145 weight=5;
server 10.14.12.133:9702 weight=8;
server 10.14.11.15:7035 weight=10;
}
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
The meaning of the above configuration is that if the client accesses the xiaobaidebug.top domain name, nginx will forward the client's request to the following 4 server ips, and there is a weight next to the ip. The higher the weight, the number of times it is forwarded. the more.
It can be seen that nginx has quite rich configuration capabilities. But it should be noted that these files need to be manually configured. This is of course fine for situations where there are few servers and things don't change that much.
But now is the cloud-native era. Many companies have their own cloud products, and services will naturally go to the cloud. Generally speaking, every time a service is updated, the service may be deployed to a new machine. And this ip will also change. Is it necessary to manually go to nginx to change the configuration every time a service is released? This is obviously not realistic.
If you can let the service actively tell nginx its own IP when the service starts, and then nginx generates such a configuration and reloads it, it will be much simpler.
In order to realize such a service registration function, many companies will conduct secondary development based on nginx.
But if there is a problem with the service registration function, for example, after the service is started, the new service is not registered, but the old service has been destroyed. At this time, nginx will also send the request to the IP of the old service. Since the machine where the old service is located no longer has this service, the server kernel will respond to RST, and nginx will reply 502 to the client after receiving the RST.
The instance has been destroyed but the configuration has not deleted the IP
It is not difficult to troubleshoot this problem.
At this time, you can check whether there are related logs printed on the nginx side, and check whether the forwarded IP port is as expected.
If it does not meet expectations, you can find colleagues who are doing this basic component and have a wave of friendly exchanges.
Summarize
The HTTP status code is used to indicate the status of the response result, where 200 is a normal response, 4xx is a client error, and 5xx is a server error.
Adding nginx between the client and the server can play the role of reverse proxy and load balancing. The client only requests data from nginx, and does not care which server handles the request.
If the backend server application crashes, nginx will receive the RST message returned by the server when accessing the server, and then return a 502 error to the client. 502 is not issued by the server application, but issued by nginx. Therefore, when 502 occurs, the back-end server is likely to have no related 502 logs, and this 502 log needs to be seen on the nginx side.
If 502 is found, first monitor and check whether the server application has crashed and restarted. If so, check whether the crash stack log has been left. If there is no log, check whether it may be oom or other reasons that cause the process to exit actively. . If the process has not crashed, check the nginx log to see if the request is sent to an unknown IP port.