Last time, we briefly analyzed the issue of network packet loss on Linux, and one of the problems was related to drivers. Today, we will examine in detail how to resolve network packet loss caused by driver issues.
What does packet loss caused by a network card driver look like?
Check ifconfig eth1/eth0 and other interfaces:
1. RX errors:
Indicates the total number of receive packet errors, including too-long-frames errors, Ring Buffer overflow errors, CRC check errors, frame synchronization errors, FIFO overruns, and missed packets, etc.
2. RX dropped:
Indicates that the packet has entered the Ring Buffer but was discarded during the copy to memory process due to system reasons such as insufficient memory. Starting from the 2.6.37 kernel, the contents of RX dropped statistics have been adjusted. Currently, the following situations will be counted in RX dropped:
● Softnet backlog full
● Bad/Unintended VLAN tags
● Unknown/Unregistered protocols
● IPv6 frames
The backlog part will be discussed in detail below.
3. RX overruns:
Overruns mean that packets were discarded by the network card’s physical layer before reaching the ring buffer. When the driver processing speed cannot keep up with the network card’s receiving speed, and the driver cannot allocate buffers in time, the NIC receives packets but cannot promptly write them to the skb, leading to accumulation. When the NIC’s internal buffer is full, it will discard some data, causing packet loss. This type of packet loss is recorded as rx_fifo_errors, reflected in the fifo field growth in /proc/net/dev and the overruns indicator increase in ifconfig. One cause of ring buffer overflow is the CPU’s inability to handle interrupts promptly, such as uneven interrupt distribution or an excessively small ring buffer.
4. RX frame: Indicates misaligned frames.
5. For transmit, the reasons for the increase in the above counters mainly include aborted transmission, carrier-related errors, FIFO errors, heartbeat errors, and window errors, while collisions indicate transmission interruptions caused by CSMA/CD.
How to fix packet loss caused by driver overruns?
netdev_max_backlog is the buffer queue that holds packets received from the NIC before handing them over to the protocol stack (such as IP, TCP) for processing. Each CPU core has a backlog queue similar to the Ring Buffer. When the rate of receiving packets exceeds the rate at which the kernel protocol stack processes them, the CPU’s backlog queue continues to grow. Once it reaches the set netdev_max_backlog value, packets will be discarded.
To determine if a netdev backlog queue overflow has occurred, you can check /proc/net/softnet_stat.
Each line represents the status statistics for each CPU core, starting from CPU0 and going down sequentially. Each column represents various statistics for a CPU core: the first column represents the total number of packets received by the interrupt handler; the second column represents the total number of packets discarded due to netdev_max_backlog queue overflow. From the above output, it can be seen that there is no packet loss due to netdev_max_backlog on this server.
netdev_max_backlog (receive) corresponds to txqueuelen (transmit).
Solution:
The default value of netdev_max_backlog is 1000. On high-speed links, the second statistic might not be zero. This can be resolved by modifying the kernel parameter net.core.netdev_max_backlog:
sysctl -w net.core.netdev_max_backlog=2000
How to fix packet loss caused by frequent network card IRQs?
When the driver enters NAPI, it disables the current network card’s IRQ, preventing new hardware interrupts until all data is polled.
Next, let’s explain how NAPI achieves IRQ coalescing. It mainly allows the NIC driver to register a poll function, enabling the NAPI subsystem to batch-pull received data from the Ring Buffer through the poll function. The main events and their sequence are as follows:
1. When initializing the network card driver, it registers a poll function with the Kernel to pull received data from the Ring Buffer later.
2. The driver registers and enables NAPI, which is disabled by default and is only enabled by drivers that support NAPI.
3. After receiving data, the network card uses DMA to store the data in memory.
4. The network card triggers an IRQ, causing the CPU to execute the driver’s registered Interrupt Handler.
5. The driver’s Interrupt Handler uses the napi_schedule function to trigger a softirq (NET_RX_SOFTIRQ), which wakes up the NAPI subsystem. The handler for NET_RX_SOFTIRQ, net_rx_action, is executed in another thread and calls the driver’s registered poll function to retrieve the received packets.
6. The driver disables the current network card’s IRQ, preventing new IRQs until all data is polled.
7. Once all tasks are completed, the NAPI subsystem is disabled, and the network card’s IRQ is re-enabled.
8. Return to step 3.
In the driver’s poll function, packets are read in a loop, with the loop controlled by a budget to prevent the CPU from looping indefinitely when there are many packets, allowing other tasks to be executed as well. The budget affects the time the CPU spends executing the poll function.
A larger budget can increase CPU utilization and reduce packet delay when there are many packets, but if too much CPU time is spent here, it can impact the execution of other tasks.
A smaller budget may lead to frequent exits from the NAPI subsystem and frequent triggering of network card IRQs.
To determine if packet loss is caused by frequent network card IRQs, you can check /proc/net/softnet_stat:
If the third column is continuously increasing, it indicates that the soft IRQ is not getting enough CPU time to process a sufficient number of network packets. In this case, you need to increase the netdev_budget value.
Solution:
The default budget is 300, but it can be adjusted.
sysctl -w net.core.netdev_budget=600
How to fix packet loss caused by high single-core load?
High single-core CPU soft interrupt usage can cause applications to have insufficient opportunities to send or receive packets or result in slower packet reception. Even if you adjust the netdev_max_backlog queue size, packet loss may still occur after a period if the processing speed cannot keep up with the network card’s receiving speed.
Check: mpstat -P ALL 1
High single-core soft interrupt usage at 100% can cause applications to have insufficient opportunities to send or receive packets, or result in slower packet reception and packet loss.
Solution:
1. Adjust the network card RSS queue configuration:
Check: ethtool -x ethx Adjust: ethtool -X ethx xxxx
2. Check if the network card interrupt configuration is balanced: cat /proc/interrupts
Adjust using irqbalance:
service irqbalance start
service irqbalance status
service irqbalance stop
Bind interrupts to CPU cores: echo mask > /proc/irq/xxx/smp_affinity
echo 6 > /proc/irq/41/smp_affinity
The value 6 represents CPU2 and CPU1. The mask for CPU 0 is 0x1 (0001), for CPU 1 is 0x2 (0010), for CPU 2 is 0x4 (0100), and for CPU 3 is 0x8 (1000), and so on.
Additionally, if you set smp_affinity, you should either disable irqbalance or configure irqbalance with a –banirq list to exclude IRQs with set smp_affinity. Otherwise, the irqbalance mechanism will ignore the IRQ affinity configurations you set.
3. Adjust the network card’s multi-queue and RPS (Receive Packet Steering) configuration according to the number of CPUs and network card queues.
If the number of CPUs is greater than the number of network card queues:
Check the network card queues: ethtool -x ethx
Enable RPS in the protocol stack and configure RPS.
echo $mask(CPU Configuration)> /sys/class/net/$eth/queues/rx-$i/rps_cpus echo 4096(NIC buff)> /sys/class/net/$eth/queues/rx-$i/rps_flow_cnt
If the number of CPUs is less than the number of network card queues, you can bind interrupts accordingly. You might also try disabling RPS to see the effect.
echo 0 > /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
4. NUMA CPU Adjustment: Aligning the network card location can improve kernel processing speed, allowing more CPUs to be allocated for application packet reception and reducing the probability of packet loss.
Check the network card NUMA location:
ethtool -i eth1|grep bus-info lspci -s bus-info -vv|grep node
In the settings for interrupts and RPS above, the mask needs to be reconfigured according to the NUMA CPU allocation.
Check:
ethtool -c ens5f0
Adjust:
ethtool -C ethx adaptive-rx on