Which network IO model should an RPC design use?

Which network IO model should an RPC design use?


The benefit of zero copy is to avoid unnecessary CPU copying, freeing the CPU to do other things, and reducing the context switching between the CPU's user space and kernel space, thereby improving network communication efficiency and application programs. overall performance.

What role does network communication play in RPC calls?  RPC is a way to solve inter-process communication.  An RPC call is essentially a process of network information exchange between the service consumer and the service provider.  The service caller sends a request message through the network IO, and the service provider receives and parses it. After processing the relevant business logic, it sends a response message to the service caller. The service caller receives and parses the response message, and processes the relevant response. Logically, an RPC call is over.  It can be said that network communication is the basis of the entire RPC call process.

1 Common network I/O model

The network communication between two PCs is the operation of the two PCs on the network IO.

Synchronous blocking IO, synchronous non-blocking IO (NIO), IO multiplexing and asynchronous non-blocking IO (AIO).  Only AIO is asynchronous IO, the others are synchronous IO.

1.1 Synchronous blocking I/O (BIO)

By default, all sockets in Linux are blocking.

After the application process initiates the IO system call, the application process is blocked and transferred to the kernel space for processing.  After that, the kernel starts to wait for the data. After waiting for the data, it copies the data in the kernel to the user memory, and returns to the process after the entire IO is processed.  The last application process unblocks the state and runs the business logic.

The system kernel processes IO operations in two stages:

  • • Waiting for data The system kernel waits for the network card to receive the data, and then writes the data to the kernel
  • • Copy data After the system kernel obtains the data, it copies the data to the space of the user process

In these two stages, the thread of the IO operation in the application process will always be in a blocked state. If the development is based on Java multi-threading, each IO operation will occupy a thread until the IO operation ends.

After the user thread initiates the read call, it blocks and gives up the CPU.  The kernel waits for the network card data to arrive, copies the data from the network card to the kernel space, then copies the data to the user space, and then wakes up the user thread.

picture

1.2 IO multiplexing (IO multiplexing)

One of the most widely used IO models in high-concurrency scenarios, such as the underlying implementation of Java's NIO, Redis, and Nginx, is the application of this type of IO model:

  • • Multi-channel, that is, multiple channels, that is, multiple network-connected IOs
  • • Multiplexing, multiple channels are multiplexed in one multiplexer

The IO of multiple network connections can be registered to a multiplexer (select). When the user process calls select, the entire process will be blocked. At the same time,  the kernel will "monitor" all the sockets that select is responsible for, and when the data in any socket is ready, select will return.  At this time, the user process calls the read operation again to copy the data from the kernel to the user process.

When the user process initiates a select call, the process will be blocked. When it is found that the socket responsible for the select has ready data, it will return, and then initiate a read. The whole process is more complicated than blocking IO, and it seems to waste performance.  But the biggest advantage is that users can process multiple socket IO requests in one thread at the same time.  Users can register multiple sockets, and then continuously call select to read the activated sockets, so as to process multiple IO requests in the same thread at the same time.  In the synchronous blocking model, it must be implemented through multithreading.

It's like we went to a restaurant for dinner. This time we went together with several people. We specially reserved one person to wait for a table in the restaurant, and the others went shopping. When the friend who waited for the number informed us that we could eat, we went straight to the restaurant. to enjoy.

Essentially multiplexing or synchronous blocking.

1.3 Why block IO, IO multiplexing is the most commonly used?

The application of network IO requires the support of the system kernel and the support of programming languages.

Most system kernels support blocking IO, non-blocking IO, and IO multiplexing, but signal-driven IO and asynchronous IO are only supported by high-level Linux system kernels.

Regardless of C++ or Java, high-performance network programming frameworks are based on Reactor mode, such as Netty, and Reactor mode is based on IO multiplexing. In  non-high concurrency scenarios, synchronous blocking IO is the most common.

The most widely used ones with the most complete system kernel and programming language support are blocking IO and IO multiplexing, which can meet the vast majority of network IO application scenarios.

1.4 Which network IO model should the RPC framework choose?

IO multiplexing is suitable for high concurrency, using fewer processes (threads) to handle more socket IO requests, but it is more difficult to use.

Blocking IO Every time a socket IO request is processed, the process (thread) will be blocked, but it is less difficult to use.  In the scenario where the concurrency is low and the business logic only needs to perform IO operations synchronously, blocking IO can meet the requirements, and there is no need to initiate a select call, and the overhead is lower than IO multiplexing.

Most of the RPC calls are high-concurrency calls. Taking into account, RPC chooses IO multiplexing.  The optimal framework selection is Netty, a framework implemented based on Reactor mode.  Under Linux, epoll should also be enabled to improve system performance.

2 Zero-copy (Zero-copy)

2.1 Network IO read and write process

picture

Each write operation of the application process writes the data to the buffer of the user space, and then the CPU copies the data to the system kernel buffer, and then the DMA copies the data to the network card, and the network card sends it out .  The data of a write operation needs to be copied twice before being sent out through the network card, while the read operation of the user process is reversed, and the data is also copied twice before the application can read the data.

A complete read and write operation of the application process must be copied back and forth between the user space and the kernel space. Each copy requires the CPU to perform a context switch (from the user process to the system kernel, or from the system kernel to the user process). Is this a waste of CPU and performance?  Is there any way to reduce data copying between processes and improve the efficiency of data transmission?

This requires zero copy: the data copy operation between the user space and the kernel space is canceled, and every read and write operation of the application process makes the application process write or read data to the user space, just like writing or reading data directly to the kernel space The same as reading data, and then copy the data in the kernel to the network card through DMA, or copy the data in the network card to the kernel.

2.2 Implementation

Is it true that both user space and kernel space write data to one place, so there is no need to copy it?  Think of virtual memory?


Virtual Memory

There are two implementations of zero copy:

mmap+write

Solved by virtual memory.

sendfile

nginx sendfile

3 Netty zero copy

The selection of the RPC framework in the network communication framework is based on the framework implemented by the Reactor mode, such as Java's preferred Netty.  Does Netty have a zero-copy mechanism?  What is the difference between zero copy in the Netty framework and the zero copy I talked about before?

The zero copy in the previous section is the zero copy of the os layer. In order to avoid the data copy operation between the user space and the kernel space, the CPU utilization rate can be improved.

The Netty zero copy is not the same, he completely stands in the user space, that is, the JVM, and is biased towards the optimization of data operations.

The meaning of Netty doing this

During the transmission process, RPC will not send all the binary data of the request parameters to the peer machine at once. It may be split into several data packets in the middle, or merged with other requested data packets, so the message must have a boundary.  After the machine at one end receives the message, it must process the data packet, divide and merge the data packet according to the boundary, and finally obtain a complete message.

After receiving the message, is the segmentation and merging of the data packets completed in the user space or in the kernel space?

Of course, it is in the user space, because the processing of the data packets is handled by the application program, so is there any possible data copy operation here? There may be, of course,  not a copy between user space and kernel space , but a copy processing operation in the internal memory of user space.  Netty's zero copy is to solve this problem and optimize data operations in user space.

So how does Netty optimize data operations?

  • • Netty provides the CompositeByteBuf class, which can combine multiple ByteBufs into a logical ByteBuf, avoiding copying between ByteBufs.
  • • ByteBuf supports slice operation, so ByteBuf can be decomposed into multiple ByteBufs sharing the same storage area, avoiding memory copying.
  • • Through wrap operation, we can wrap byte[] array, ByteBuf, ByteBuffer, etc. into a Netty ByteBuf object, thereby avoiding copy operation.

Many internal ChannelHandler implementation classes in the Netty framework handle the unpacking and sticking problems in TCP transmission through CompositeByteBuf, slice, and wrap operations.

Netty solves the data copy between user space and kernel space

Netty's ByteBuffer uses Direct Buffers and uses direct memory outside the heap to read and write sockets. The final effect is the same as that achieved by the virtual memory I just explained.

Netty also provides the FileChannel.transferTo() method that wraps NIO in FileRegion to achieve zero copy, which is the same principle as the sendfile method in Linux.

4 Summary

The benefit of zero copy is to avoid unnecessary CPU copying, freeing the CPU to do other things, and reducing the context switching between the CPU's user space and kernel space, thereby improving network communication efficiency and application programs. overall performance.

Netty zero copy is different from os zero copy. Netty zero copy is biased towards the optimization of data operations in user space, which is of great significance for dealing with the problem of unpacking and sticking packets in TCP transmission, and for applications to process request data and return data It also has significance.