How to Troubleshoot Linux KVM Packet Loss Issues?

Home » Network » How to Troubleshoot Linux KVM Packet Loss Issues?

Recently, I’ve been troubleshooting some network issues, such as connect timeout, read timeout, and packet loss. I wanted to organize some information to share with the team and developers.

Let’s first look at the process of receiving data packets in a Linux system:

1. The network card receives the data packet.

2. The data packet is transferred from the network card hardware buffer to the server memory.

3. The kernel is notified to process it.

4. The packet is processed layer by layer through the TCP/IP protocol.

5. The application reads from the socket buffer using read().

How to check packlet loss in Linux?

You can check it via the output of ifconfig:

# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.5.224.27  netmask 255.255.255.0  broadcast 10.5.224.255
        inet6 fe80::5054:ff:fea4:44ae  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:a4:44:ae  txqueuelen 1000  (Ethernet)
        RX packets 9525661556  bytes 10963926751740 (9.9 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8801210220  bytes 12331600148587 (11.2 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

RX (receive) represents received packets, while TX (transmit) represents transmitted packets.

RX errors: Indicates the total number of errors in receiving packets, including errors such as too-long-frames, Ring Buffer overflow, CRC checksum errors, frame synchronization errors, FIFO overruns, and missed packets, etc.

RX dropped: Indicates that packets have entered the Ring Buffer but were dropped during the copying process to memory due to system reasons such as insufficient memory.

RX overruns: Indicates FIFO overruns, which occur when the I/O transferred by the Ring Buffer (also known as Driver Queue) exceeds what the kernel can handle. The Ring Buffer refers to the buffer before an IRQ request is initiated. An increase in overruns means packets are being dropped by the network card’s physical layer before reaching the Ring Buffer. One cause of Ring Buffer overflow can be uneven distribution of interrupts (all on core0), as was the case with the problematic machine, leading to packet loss.

RX frame: Represents misaligned frames.

For TX (transmit), the increase in the above counters is mainly due to aborted transmissions, errors caused by carrier issues, FIFO errors, heartbeat errors, and window errors. Collisions indicate transmission interruptions caused by CSMA/CD.

Difference between dropped and overruns

Dropped: Indicates that the packet has entered the network card’s receive FIFO queue and has begun to be processed by the system interrupt for copying (from the network card’s FIFO queue to system memory). However, due to system reasons (e.g., insufficient memory), the packet is dropped, meaning it is discarded by the Linux system.

Overruns: Indicates that the packet was dropped before entering the network card’s receive FIFO queue, meaning the FIFO was full. The FIFO may be full because the system is too busy to respond to network card interrupts in time, causing packets in the network card to not be copied to system memory promptly. A full FIFO prevents subsequent packets from being accepted, resulting in the packet being discarded by the network card hardware. Therefore, if overruns are non-zero, it is advisable to check the CPU load and interrupt handling.

netstat -i will also provide information on the number of packets sent and received, as well as packet loss for each network interface.

# netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      1500 9528312730      0      0 0      8803615650      0      0      0 BMRU

Common Causes of Network Card Packet Loss in Linux

Based on my experience, common causes of packet loss in Linux network cards include Ring Buffer overflow, netdev_max_backlog overflow, Socket Buffer overflow, PAWS, etc.

Ring Buffer Overflow

If there are no issues with the hardware or driver, packet loss on the network card is generally due to a buffer (ring buffer) that is set too small. When the rate at which network data packets arrive (produce) is faster than the rate at which the kernel processes (consumes) them, the Ring Buffer will quickly fill up, and new incoming packets will be dropped.

# ethtool -S eth0|grep rx_fifo
rx_fifo_errors: 0
# cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
  eth0: 10967216557060 9528860597    0    0    0     0          0         0 12336087749362 8804108661    0    0    0     0       0          0

You can view the statistics of packets dropped due to a full Ring Buffer using ethtool or /proc/net/dev. These statistics are marked with fifo:

If you notice that the fifo count for a particular network card on the server is continuously increasing, you should check if CPU interrupts are evenly distributed. You can also try increasing the size of the Ring Buffer. ethtool can be used to view the maximum value of the Ring Buffer for the network card device and to modify the current Ring Buffer settings.

# View the maximum value and current settings of the Ring Buffer for the eth0 network card
$ ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:     4096   
RX Mini:    0
RX Jumbo:   0
TX:     4096   
Current hardware settings:
RX:     1024   
RX Mini:    0
RX Jumbo:   0
TX:     1024   
# Modify the receive and transmit hardware buffer size for the eth0 network card
$ ethtool -G eth0 rx 4096 tx 4096
Pre-set maximums:
RX:     4096   
RX Mini:    0
RX Jumbo:   0
TX:     4096   
Current hardware settings:
RX:     4096   
RX Mini:    0
RX Jumbo:   0
TX:     4096

netdev_max_backlog Overflow

netdev_max_backlog is the buffer queue where packets are held after being received from the NIC and before being processed by the protocol stack (e.g., IP, TCP). Each CPU core has a backlog queue. Similar to the Ring Buffer, when the rate of receiving packets exceeds the rate at which the protocol stack processes them, the CPU’s backlog queue continues to grow. When it reaches the set netdev_max_backlog value, packets will be dropped.

# cat /proc/net/softnet_stat 
2e8f1058 00000000 000000ef 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0db6297e 00000000 00000035 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
09d4a634 00000000 00000010 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0773e4f1 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Each row represents the status statistics of each CPU core, starting from CPU0 and going down sequentially. Each column represents various statistics for a CPU core: the first column represents the total number of packets received by the interrupt handler; the second column represents the total number of packets discarded due to netdev_max_backlog queue overflow. From the output above, it can be seen that in the statistics of this server, there are indeed packet losses caused by netdev_max_backlog.

The default value for netdev_max_backlog is 1000. On high-speed links, the second column in the statistics may show non-zero values. This can be resolved by modifying the kernel parameter net.core.netdev_max_backlog:

sysctl -w net.core.netdev_max_backlog=2000

Socket Buffer Overflow

Sockets can mask the differences between various protocols in the Linux kernel, providing a unified access interface for applications. Each socket has a read and write buffer.

The read buffer caches data sent from the remote end. If the read buffer is full, no new data can be received.

The write buffer caches data that is to be sent out. If the write buffer is full, the write operations of the application will be blocked.

In the case where the half-connection is full, if the syncookie mechanism is enabled, the SYN packets will not be directly discarded. Instead, a SYN+ACK packet with a syncookie will be sent back, designed to prevent SYN Flood attacks from making normal request services unavailable.

PAWS

The full name of PAWS is Protect Against Wrapped Sequence numbers, and its purpose is to address the issue where TCP sequence numbers might be reused within a single session under high bandwidth conditions.

As shown in the diagram above, a packet A1 sent by the client with sequence number A gets “lost” in the network for some reason and doesn’t reach the server in a timely manner. The client then times out and retransmits a packet A2 with the same sequence number A. Assuming that the bandwidth is sufficient, the sequence number space is exhausted, and A is reused. At this point, the server is waiting for a packet A3 with sequence number A. Coincidentally, the previously “lost” packet A1 arrives at the server. If the server determines the validity of a packet based solely on the sequence number A, it would pass incorrect data to the user-space program, causing program anomalies.

PAWS aims to solve this problem by relying on the timestamp mechanism. The theoretical basis is that, in a normal TCP flow, the timestamps of all received TCP packets should be monotonically non-decreasing. This allows the determination that packets with timestamps smaller than the highest timestamp processed by the current TCP flow are delayed duplicate packets, which can be discarded. In the example above, the server has processed packet Z, and the timestamp of the subsequently arriving packet A1 will inevitably be smaller than that of packet Z, so the server will discard the delayed packet A1 and wait for the correct packet to arrive.

The key to implementing the PAWS mechanism is that the kernel retains the most recent timestamp for each connection. With improvements, this can be used to optimize the rapid recycling of the server’s TIME_WAIT state.

The TIME_WAIT state is the final state that the party initiating the connection closure enters after the four-way handshake of a TCP connection closure. It usually needs to be maintained for 2*MSL (Maximum Segment Lifetime) in this state. It serves two purposes:

1. Reliably implementing the closure of a TCP full-duplex connection: During the four-way handshake process of closing the connection, the final ACK is sent by the party that initiates the connection closure (referred to as A). If this ACK is lost, the other party (referred to as B) will retransmit the FIN. If A does not maintain the TIME_WAIT state and directly enters the CLOSED state, it cannot retransmit the ACK, and B’s connection cannot be reliably released in a timely manner.

2. Waiting for “lost” duplicate packets in the network to expire: If A’s packet is “lost” and does not reach B in time, A will retransmit the packet. After A and B complete the transmission and disconnect, if A does not maintain the TIME_WAIT state for 2MSL, a new connection with the same source and destination ports could be established between A and B, and the “lost” packet from the previous connection might arrive and be processed by B, causing an exception. Maintaining the 2MSL time ensures that packets from the previous connection disappear from the network.

Connections in the TIME_WAIT state need to consume server memory resources to be maintained. The Linux kernel provides a parameter, tcp_tw_recycle, to control the rapid recycling of the TIME_WAIT state. The theoretical basis is:

Based on the theory of PAWS, if the kernel retains the most recent timestamp per host, and timestamps are compared when receiving packets, the second issue that TIME_WAIT aims to resolve—treating packets from a previous connection as valid in a new connection—can be avoided. Thus, there is no need to maintain the TIME_WAIT state for 2*MSL, but only to wait for sufficient RTO (Retransmission Timeout) to handle ACK retransmissions, achieving rapid recycling of TIME_WAIT state connections.

However, this theory presents new problems when multiple clients access the server using NAT: The timestamps of multiple clients behind the same NAT are difficult to keep consistent (the timestamp mechanism uses the system boot relative time). For the server, connections established by two clients each appear as two connections from the same peer IP. According to the per-host most recent timestamp record, it will be updated to the larger timestamp of the two client hosts, causing packets from the client with the relatively smaller timestamp to be discarded as expired duplicates by the server.

Using netstat, you can obtain statistics on packets discarded due to PAWS mechanism timestamp validation:

# netstat -s |grep -e "passive connections rejected because of time stamp" -e "packets rejects in established connections because of timestamp”
387158 passive connections rejected because of time stamp
825313 packets rejects in established connections because of timestamp

Use sysctl to check whether tcp_tw_recycle and tcp_timestamp are enabled:

$ sysctl net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 1
$ sysctl net.ipv4.tcp_timestamps
net.ipv4.tcp_timestamps = 1

If the server is providing services as a server, and it is clear that clients will access through NAT networks, or if there are Layer 7 forwarding devices that replace the client source IP, tcp_tw_recycle should not be enabled. However, timestamps are relied upon by other mechanisms besides supporting tcp_tw_recycle, so it is recommended to keep them enabled.

sysctl -w net.ipv4.tcp_tw_recycle=0
sysctl -w net.ipv4.tcp_timestamps=1

Where were the packets dropped?

1. Dropwatch

# dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at sk_stream_kill_queues+50 (0xffffffff81687860)
1 drops at tcp_v4_rcv+147 (0xffffffff8170b737)
1 drops at __brk_limit+1de1308c (0xffffffffa052308c)
1 drops at ip_rcv_finish+1b8 (0xffffffff816e3348)
1 drops at skb_queue_purge+17 (0xffffffff816809e7)
3 drops at sk_stream_kill_queues+50 (0xffffffff81687860)
2 drops at unix_stream_connect+2bc (0xffffffff8175a05c)
2 drops at sk_stream_kill_queues+50 (0xffffffff81687860)
1 drops at tcp_v4_rcv+147 (0xffffffff8170b737)
2 drops at sk_stream_kill_queues+50 (0xffffffff81687860)

2. monitoring kfree_skb events with perf

# perf record -g -a -e skb:kfree_skb
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.212 MB perf.data (388 samples) ]
# perf script
containerd 93829 [031] 951470.340275: skb:kfree_skb: skbaddr=0xffff8827bfced700 protocol=0 location=0xffffffff8175a05c
            7fff8168279b kfree_skb ([kernel.kallsyms])
            7fff8175c05c unix_stream_connect ([kernel.kallsyms])
            7fff8167650f SYSC_connect ([kernel.kallsyms])
            7fff8167818e sys_connect ([kernel.kallsyms])
            7fff81005959 do_syscall_64 ([kernel.kallsyms])
            7fff81802081 entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                   f908d __GI___libc_connect (/usr/lib64/libc-2.17.so)
                  13077d __nscd_get_mapping (/usr/lib64/libc-2.17.so)
                  130c7c __nscd_get_map_ref (/usr/lib64/libc-2.17.so)
                       0 [unknown] ([unknown])
containerd 93829 [031] 951470.340306: skb:kfree_skb: skbaddr=0xffff8827bfcec500 protocol=0 location=0xffffffff8175a05c
            7fff8168279b kfree_skb ([kernel.kallsyms])
            7fff8175c05c unix_stream_connect ([kernel.kallsyms])
            7fff8167650f SYSC_connect ([kernel.kallsyms])
            7fff8167818e sys_connect ([kernel.kallsyms])
            7fff81005959 do_syscall_64 ([kernel.kallsyms])
            7fff81802081 entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                   f908d __GI___libc_connect (/usr/lib64/libc-2.17.so)
                  130ebe __nscd_open_socket (/usr/lib64/libc-2.17.so)

3. tcpdrop

TIME     PID    IP SADDR:SPORT          > DADDR:DPORT          STATE (FLAGS)
05:46:07 82093  4  10.74.40.245:50010   > 10.74.40.245:58484   ESTABLISHED (ACK)
    tcp_drop+0x1
    tcp_rcv_established+0x1d5
    tcp_v4_do_rcv+0x141
    tcp_v4_rcv+0x9b8
    ip_local_deliver_finish+0x9b
    ip_local_deliver+0x6f
    ip_rcv_finish+0x124
    ip_rcv+0x291
    __netif_receive_skb_core+0x554
    __netif_receive_skb+0x18
    process_backlog+0xba
    net_rx_action+0x265
    __softirqentry_text_start+0xf2
    irq_exit+0xb6
    xen_evtchn_do_upcall+0x30
    xen_hvm_callback_vector+0x1af
05:46:07 85153  4  10.74.40.245:50010   > 10.74.40.245:58446   ESTABLISHED (ACK)
    tcp_drop+0x1
    tcp_rcv_established+0x1d5
    tcp_v4_do_rcv+0x141
    tcp_v4_rcv+0x9b8
    ip_local_deliver_finish+0x9b
    ip_local_deliver+0x6f
    ip_rcv_finish+0x124
    ip_rcv+0x291
    __netif_receive_skb_core+0x554
    __netif_receive_skb+0x18
    process_backlog+0xba
    net_rx_action+0x265
    __softirqentry_text_start+0xf2
    irq_exit+0xb6
    xen_evtchn_do_upcall+0x30
    xen_hvm_callback_vector+0x1af