Linux kernel bypass and performance tuning

Some nice tips on performance tuning:

TCP Bypass

The application uses it's own packet definition, builds the packets on the sending side and sends them directly over the link layer where they are decoded and the data processed. There are a number of products that do this, such as, Scali MPI Connect, Parastation, HyperSCSI (iSCSI like storage), and Coraid ATA-over-Ethernet. The term "TCP-bypass" is used because you are "bypassing" the TCP packet protocol and using a different protocol right on top of the link layer.

The idea behind using a different protocol is that you reduce the overhead of having to process TCP packets. Theoretically this allows you to process data packets faster (less overhead) and fit more data into a given packet size (better data bandwidth). So you can increase the bandwidth and reduce the latency. However, there are some gotchas with TCP-bypass. The code has to perform all of checks that the TCP protocol performs, particularly the retransmission of dropped packets. In addition, because the packets are not TCP, they are not routable across Ethernet. So you can't route these packets to other networks. This is not likely to be a problem for clusters that reside on a single network, but Grid applications will very likely have this problem because they need to route packets to remote systems.

TCP Offload Engine

TCP Offload Engines (TOE) are a somewhat controversial concept in the Linux world. Normally, the CPU processes the TCP packets, which can require extreme processing power. For example, it has been reported that a single GigE connection to a node can saturate a single 2.4 GHz Pentium IV processor. The problem of processing TCP packets is worse for small packets. To process the packets, the CPU has to be interrupted from what it's doing to process the packet. Consequently, if there are a number of small packets, the CPU can end up processing just the packets and not doing any computing.

The TOE was developed to remove the TCP packet processing from the CPU and put it on a dedicated processor. In most cases, the TOE is put on a NIC. It handles the TCP packet processing and then passes the data to the kernel, most likely over the PCI bus. For small data packets, the PCI bus is not very efficient. Consequently, the TOE can collect from a series of small packets and then send a larger combined packet across the PCI bus. This design increases latency, but may reduce the impact on the node processing. This feature is more likely to be appropriate for enterprise computing than HPC.

Kernel Bypass

Kernel Bypass, also called OS bypass, is a concept to improve the network performance, by going "around" the kernel or OS. Hence the term, "bypass." In a typical system, the kernel decodes the network packet, most likely TCP, and passes the data from the kernel space to user space by copying it. This process means the user space process context data must be saved and the kernel context data must be loaded. This step of saving the user process information and then loading the kernel process information is known as a context switch. According to this article, application context switching constitutes about 40% of the network overhead. So, it would seem that to improve bandwidth and latency of an interconnect, it would be good to eliminate the context switching.

In Kernel bypass, the user space applications communicate with the I/O library that has been modified to communicate directly with the user space application. This process takes the kernel out of the path of communication between the user space process and the I/O subsystem that handles the network communication. This change eliminates the context switching and potentially the copy from the kernel space to user space (it depends upon how the I/O library is designed). However, people are arguing that the overhead in the kernel associated with a context switch has shrunk. Combined with faster processors the impact of a context switch has lessened.

RDMA

Remote Direct Memory Access (RDMA) is a concept that allows NICs to place data directly into the memory of another system. The NICs have to be RMDA enabled on both the send and receive ends of the communication. RDMA is useful for clusters because it allows the CPU to continue to compute while the RDMA enabled NICs are passing data. This can help improve compute/communication overlap, which helps improve code scalability.

The process begins with the sending RDMA NIC establishes a connection with the receiving RDMA NIC. Then the data is transferred from the sending NIC to the receiving NIC. The receiving NIC then copies the data directly to the application memory bypassing the data buffers in the OS. RDMA is most commonly used in Infiniband implementations, but other high-speed interconnects use it as well. Recently 10 GigE NICs started using RDMA for TCP traffic to improve the performance. There has been some discussions lately that RDMA may have outlived it's usefulness for MPI codes. The argument is that most messages in HPC codes are small to medium in size and that using memory copies to move the data from kernel space to user space is faster than having having a RDMA NIC to it. Reducing the amount of time the kernel takes to do the copy and improving processors speeds are two of the reasons that a memory copy could be faster than RDMA.

There is a RDMA Consortium that helps organize and promote RMDA efforts. They develop specifications and standards for RDMA implementations so the various NICs can communicate with each other. Their recent efforts have resulted in the development of an RDMA set of specifications for TCP/IP over Ethernet.

Zero-Copy Networking

Zero-Copy networking is a technique where the CPU does not perform the data copy from the kernel space to user space (the application memory). This trick can be done for both send operations and receive operations. This can be accomplished in a number of ways including using DMA (Direct Memory Access) copying or memory mapping using a MMU (Memory Management Unit) equipped system. Zero-copy networking has been in the Linux kernel for some time, since the 2.4 series. Here is an article that discusses how the developers went about accomplishing it. The article gives some details at a high level about how one accomplishes this. It also points out that zero-copy networking requires extra memory and a fast system to perform the operations. If you would like more information, this article can give you even more detail.

With every new idea there are always seems to differing opinions. Here is an argument that zero-copy may not be worth the trouble. Rather the authors argue that a Network Processor (kind of a programmable, intelligent NIC) would be a better idea.

Interrupt Mitigation

Interrupt Mitigation also called Interrupt Coalescence, is a another trick to reduce the load on the CPU resulting from interrupts to process packets. As mentioned earlier, every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine if the data is destined for that NIC, and if it is, process the data. Consequently, it is very simple to create a Denial-of-Service (DOS) attack by flooding the network with a huge number of packets forcing the CPU to process everyone of them. Interrupt Mitigation is a driver level implementation that collects packets for a certain amount of time or a certain total size, and then interrupts the CPU for processing. The idea is that this reduces the overall load on the CPU and allows it to at least do some computational work rather than just decode network packets. However, this can increase latency by holding data before allowing the CPU to process it.

Interrupt Coalescence has been implemented in Linux through NAPI (New API) rewrites of the network drivers. The rewrites include interrupt limiting capabilities on both the receive side (Rx) as well as the transmit side (Tx). Fortunately, the people who wrote the drivers allow the various parameters to be adjusted. For example, in this article, Doug Eadline (Head Monkey) experimented with various interrupt throttling options (interrupt mitigation). Using the stock settings with the driver the latency was 64 microseconds. After turning off interrupt throttling, the latency was reduced to 29 microseconds. Of course, we assume the CPU load was higher, but we didn't measure that.

There are two good articles that discuss "tweaking" GigE, TCP NIC drivers. The first describes some parameters and what they do for the drivers. The second describes the same thing but with more of a cluster focus. Both are useful for helping you understand what to tweak and why, and what the impacts are.