Is your network stack set up correctly?

Introduction

Linux Advent Calendar Day 10 article.

In the field of operation and research and development, I think that there are many opportunities to measure the communication speed between computers with benchmark tools or own applications for software experiments or equipment testing and selection. On the other hand, in recent high-speed networks such as 10 Gbps and 40 Gbps, these measurement results vary greatly depending on the implementation of the communication API part of the application, kernel parameters or compile options, so set them correctly for accurate measurement. / Need to understand. This article aims to give you an overview of how kernels and applications work around your network and the key points in it.

Review of network programming

First of all, let's take a quick look at how today's server programs that use TCP are made. The server program basically waits for a request from the client, processes the request, and returns the response to the client, and the client program generates the request, sends the request, and responds from the server. The process is to wait, and although the order is different, the structure of the program is almost the same. Since both the client and the server need to handle requests (responses in the case of clients) on multiple TCP connections that operate asynchronously, use an event monitoring framework provided by the OS such as epoll. The image of the program using epoll_ * looks like the following. See, for example, here for the complete code.

int lfd, epfd, newfd, nevts = 1000;
struct epoll_event ev;
struct epoll_event evts[1000]; /*Receive up to 1000 requests at once*/
struct sockaddr_in sin = {.sin_port = htons(50000), .sin_addr = INADDR_ANY};
char buf[10000]; /*Receive 10KB/Send buffer*/

/*Wait for a new connection(listen)Socket for*/
lfd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
bind(lfd, &sin, sizeof(sin));
listen(listen_fd);

/*Event file descriptor for waiting for multi-socket events*/
epfd = epoll_create1(EPOLL_CLOEXE);

/*Register listen socket in event file descriptor*/
bzero(&ev, sizeof(ev));
ev.events = POLLIN;
ev.data.fd = lfd;
epoll_ctl(epfd, EPOLL_CTL_ADD, ev.data.fd, &ev);

for (;;) {
  int i, nfds; 
  /*Block up to 1000ms and wait for an event*/
  nfds = epoll_wait(epfd, evts, nevts, 1000);
  for (i = 0; i < nfds; i++) {
    int fd = evts[i].data.fd;
    if (fd == lfd) { /*New TCP connection*/
        bzero(&ev, sizeof(ev));
        newfd = accept(fd);
        ev.events = POLLIN;
        ev.data.fd = newfd;
        epoll_ctl(epfd, EPOLL_CTL_ADD, newfd, &ev);
    } else { /*Requests for existing TCP connections*/
        read(fd, buf);
        /*Process the request read into buf and prepare the response to buf*/
        write(fd, buf); /*Send response*/
    }
  }
}

In this program, a file descriptor epfd for event monitoring, a socket lfd for listening to a new connection, and a socket for each TCP connection for which a connection has been established will appear. epoll_wait () gets the events from the kernel and processes them one by one (for loop, which increments the variable i). New connection requests are posted to the listen socket (lfd), and requests on existing connections are posted to the existing connection's socket. For the former event, the accept () system call creates a socket corresponding to the new connection and registers it in the list of descriptors handled by epoll (epoll_ctl ()). For the latter event, read () reads the request for the target descriptor, does something, and then writes () the response. To use multiple CPU cores, run this epoll_wait () event loop in a separate thread on each core.

Server applications and libraries such as nginx, memcached, Redis, and libuv have almost the same programming model, so when running a program that handles multiple TCP connections, keep an overview of such behavior in mind. I think it's good.

Network stack review

Next, let's take a quick look at how the network stack in Linux works. When the packet arrives at the NIC, the NIC forwards the packet into memory and interrupts the CPU. This will temporarily stop the thread currently running on that CPU and instead execute something like a function called an interrupt handler. Strictly speaking, it is divided into hardware interrupts and software interrupts, but here, these processes are collectively referred to as an interrupt handler. Interrupt handlers, for example, move packets onto the network stack (IP or TCP) for header processing. During this process, it is determined whether the packet needs to be notified to the application (ACK packet for SYN / ACK representing a new connection, packet containing data). If the application needs to be notified for the packet processing and the application has registered and waited for the descriptor in epoll_wait () (that is, in the case of the above code), the epoll queue (above code example) Register the descriptor and event content (POLLIN if received) in (corresponding to epfd). Events registered in this way will be notified collectively when the application is blocking epoll_wait ().

Parameters that greatly affect performance

So far we have briefly seen how the kernel receives a packet until the application receives the data, but there are many parameters involved in this. This section uses experiments to explain the effects of these parameters on network performance.

The server is Intel Xeon Silver 4110 (2.1Ghz), the client is Xeon E5-2690v4 (2.6 Ghz), and it is connected by Intel X540 10GbE NIC. Details will be described later, but unless otherwise specified as a basic configuration, turbo boost and hyper-threading, CPU sleep mode, netfilter are all disabled, the number of NIC queues is one, and in epoll. The block time is set to zero. As benchmark software, the server works the same as the above code Experimental server program, the client is Use the popular HTTP benchmarking tool wrk. wrk will continue to send requests to the server and receive responses over the specified number of TCP connections for the specified amount of time. The TCP connection remains taut, and the size excluding the HTTP GET and OK TCP / IP / Ethernet headers exchanged is 44 bytes and 151 bytes, respectively. Since both are small data and fit in one packet, NIC offload functions such as TSO, LRO, and checksum offload have no effect and are disabled.

NIC interrupt delay

I mentioned earlier that the NIC interrupts the CPU, but on fast networks, the packet reception rate can be millions or even tens of millions of packets per second. Since interrupt processing is interrupted for the current application, if the packet reception rate is too high, most CPU time will be used for interrupt processing, or application or kernel processing will be frequent. There is a problem that it is interrupted. This problem is commonly referred to as live rock. Since it takes tens to hundreds of ns to process one received packet, it is calculated that a CPU cycle of several Ghz is used only for processing the entire interrupt.

Therefore, the NIC has a mechanism to reduce the frequency of interrupting the CPU. By default, it is often set to about once per 1us. However, simply increasing this value does not mean that interrupts can be reduced, and even if a new packet comes in, interrupts will not occur for a while, so there is a problem that the delay at low load increases. Also, due to the mechanism called NAPI, the kernel disables interrupts from the NIC by itself according to the amount of unprocessed received packets, so even if this value is increased, even the throughput does not increase, and the reception of new packets is delayed. It can also be a situation such as.

Here, let's measure the time it takes to issue a request via HTTP and receive a response due to the difference in interrupt rate. First, make only one TCP connection to the server, and then repeat sending HTTP GET and receiving HTTP OK. Below are the commands and results, with the server interrupt delay set to zero. -d 3 represents the time of the experiment, and -c 1 and -t 1 represent one connection and one thread, respectively.

root@client:~# wrk -d 3 -c 1 -t 1 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    24.56us    2.11us 268.00us   97.76%
    Req/Sec    38.43k   231.73    38.85k    64.52%
  118515 requests in 3.10s, 17.07MB read
Requests/sec:  38238.90
Transfer/sec:      5.51MB

The following is the result when the interrupt delay on the server side is set to 1us.

root@client:~# wrk -d 3 -c 1 -t 1 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.16us    2.59us 314.00us   99.03%
    Req/Sec    32.53k   193.15    32.82k    74.19%
  100352 requests in 3.10s, 14.45MB read
Requests/sec:  32379.67
Transfer/sec:      4.66MB

As you can see, the result is quite different.

Next, try using multiple parallel TCP connections. With the following command, the client makes 100 TCP connections, sends a request on each connection with 100 threads, and receives a response.

The following is the case without interrupt delay

root@client:~# wrk -d 3 -c 100 -t 100 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   245.83us   41.77us   1.22ms   84.60%
    Req/Sec     4.05k   152.00     5.45k    88.90%
  1248585 requests in 3.10s, 179.80MB read
Requests/sec: 402774.94
Transfer/sec:     58.00MB

Below is the 1us interrupt delay.

root@client:~# wrk -d 3 -c 100 -t 100 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   247.22us   41.20us   1.16ms   84.70%
    Req/Sec     4.03k   137.84     5.43k    90.80%
  1241477 requests in 3.10s, 178.78MB read
Requests/sec: 400575.95
Transfer/sec:     57.68MB

The interrupt delay has little effect, but this is due to the NIC's ability to disable interrupts on its own. Since many packets arrive at the same time (up to 100), the kernel temporarily disables interrupts from the NIC even if the interrupt frequency is set high.

Another point to note here is that the delay is about an order of magnitude larger than when there is no parallel connection. This is because multiple events (incoming requests) are processed one at a time within the application's event loop. Imagine the server program above receiving up to 100 events at the same time (up to 100 nfds returned by epoll_wait ()).

The NIC interrupt rate is different for the ixgbe driver (Intel X520 and X540 10GbE NIC) and i40e driver (Intel X710 / XXV710 / XL710 10/25/40 GbE NIC).

root@server:~# ethtool -C enp23s0f0 rx-usecs 0 #Zero interrupt delay

It can be set as follows.

When measuring network delay, review the NIC interrupt settings.

Turbo boost

This function raises the clock of another CPU core with a high load when the load of some CPU cores is low. This is actually a very annoying feature and should be disabled for performance measurement purposes. The reason is that when conducting a scalability experiment on the number of CPU cores, the throughput scales linearly from 1 to 3 cores out of 6 cores, but then suddenly stops scaling. happen. This wasn't because the program or kernel was bad, of course, but because if the number of cores used was 1-3, the remaining cores were idle and the active cores were clocked up by turbo boost. There are many.

To disable Turbo Boost, you can either disable it in the BIOS settings or do the following:

root@server:~# sh -c "echo 1 >> /sys/devices/system/cpu/intel_pstate/no_turbo"

If you have problems with multi-core scalability, you should also review your turbo boost settings.

Hyper-threading

Let's turn it off without thinking

CPU sleep function

To save power consumption, modern CPUs go into multiple stages of sleep depending on the load, and software performance may become unstable due to the movement between the sleep states. As a seemingly mysterious phenomenon caused by this, for example, in the code like the above server program, if you handle multiple parallel connections than if you handle a single connection, the load will increase moderately and the CPU will not go to sleep. It happens that the delay observed by the client becomes shorter. For example, in Figure 2 of this paper, the delay for 5 connections is slightly smaller than for 1 connection, which is inadvertent. I forgot to disable sleep on this CPU.

To prevent the CPU from going to sleep, it is a good idea to set intel_idle.max_cstate = 0 processor.max_cstate = 1 in the kernel boot parameters (specified in files such as grub.cfg and pxelinux.cfg / default). Below is an excerpt from my pxelinux.cfg / default

APPEND  ip=::::::dhcp console=ttyS0,57600n8 intel_idle.max_cstate=0 processor.max_cstate=1

Number of NIC queues

Modern NICs have multiple packet buffer queues, each of which can interrupt a separate CPU. Therefore, the number of NIC queues has a large effect on interrupt handling in the kernel and should be reviewed carefully.

root@c307:~# wrk -d 3 -c 1 -t 1 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    25.22us    2.03us 259.00us   97.68%
    Req/Sec    37.50k   271.83    38.40k    74.19%
  115595 requests in 3.10s, 16.65MB read
Requests/sec:  37299.11
Transfer/sec:      5.37MB

The above is the same experiment as the one in which the interrupt delay was set to zero in the [NIC interrupt delay](#NIC interrupt delay) section earlier, but the number of queues on the server NIC was set to 8 and the number of threads and cores. Is set to 8. As an image, think of the above server program's epoll_wait event loop running in a separate thread on each core. The reason why the throughput does not improve is that only one connection is used in this experiment, and the NIC determines the interrupt destination queue / CPU based on the hash value of the connection port and address, so all processing is the same CPU. Because it is done in a thread. Also, the delay is slightly increased compared to the [Interrupt Delay Section](#NIC interrupt delay) experiment using 1 thread and 1 queue (24.56-> 25.22). This shows that enabling multiple queues causes a slight overhead. Therefore, depending on the experiment, it may be better to reduce the number of queues to one.

As you can see below, with 100 connections, packets belonging to different connections are distributed across multiple queues, resulting in scaled throughput, if not perfect.

root@client:~# wrk -d 3 -c 100 -t 100 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    65.68us  120.17us  17.26ms   99.53%
    Req/Sec    15.52k     1.56k   28.00k    81.43%
  4780372 requests in 3.10s, 688.40MB read
Requests/sec: 1542345.05
Transfer/sec:    222.11MB

The number of NIC queues is

ethtool -L enp23s0f0 combined 8 #Increased the number of NIC queues to 8

It can be set as follows.

In summary, the NIC queue should basically be the same as the number of cores used by the application, but in some cases it can be an overhead, so change it as needed. And of course, be aware that your application must also be programmed or configured to run using multiple cores.

How it works for firewalls

The Linux kernel provides a netfilter mechanism for hooking packets at various locations in the network stack. These hooks are dynamically enabled depending on iptables etc., but this mechanism itself often affects the performance regardless of whether individual hooks are enabled or disabled. If you want to measure performance accurately, disable CONFIG_NETFILTER and CONFIG_RETPOLINE in your kernel configuration unless you need to.

Application blocking

As we saw in the server program above, the application normally blocks (sleeps) with epoll_wait () and waits for an event. You can pass a value to epoll_wait () that specifies the blocking time in ms. However, as mentioned above, the interrupt handler is executed by borrowing the context of the running thread, so if there is no running thread when the CPU receives an interrupt from the NIC, the sleeping thread is first executed. You will need to wake it up. The operation that wakes up this thread involves considerable overhead. The following is the result of using epoll_wait () to block 1000ms (NIC interrupt delay is zero).

root@client:~# wrk -d 3 -c 1 -t 1 http://192.168.11.3:60000/
Running 3s test @ http://192.168.11.3:60000/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    33.56us    3.28us 282.00us   95.50%
    Req/Sec    28.42k   192.11    28.76k    54.84%
  87598 requests in 3.10s, 12.61MB read
Requests/sec:  28257.46
Transfer/sec:      4.07MB

In the experiment in the [NIC interrupt delay](#NIC interrupt delay) section, it was 24.56us, so we can see that the delay has increased by nearly 10us. To ensure that there is a running thread whenever the CPU is interrupted, it's okay to pass zero blocking time to epoll_wait ().

Summary

In this article, we have introduced various parameters that affect network performance. We hope this will be useful for experiments by server administrators, application developers, and those who are (planning) researching network stacks and (library) OS.

Recommended Posts

Is your network stack set up correctly?
Set up a simple local server on your Mac
PySpark: Set up PySpark on your Alibaba Cloud CentOS instance
How to set up and compile your Cython environment
Set up Jetson nano
Set up your own web server within your Pepper app project