Recently, I’ve been troubleshooting some network issues, such as connect timeout, read timeout, and packet loss. I wanted to organize some information to share with the team and developers.
Let’s first look at the process of receiving data packets in a Linux system:
1. The network card receives the data packet.
2. The data packet is transferred from the network card hardware buffer to the server memory.
3. The kernel is notified to process it.
4. The packet is processed layer by layer through the TCP/IP protocol.
5. The application reads from the socket buffer using read().
How to check packlet loss in Linux?
You can check it via the output of ifconfig:
# ifconfig eth0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.5.224.27 netmask 255.255.255.0 broadcast 10.5.224.255 inet6 fe80::5054:ff:fea4:44ae prefixlen 64 scopeid 0x20<link> ether 52:54:00:a4:44:ae txqueuelen 1000 (Ethernet) RX packets 9525661556 bytes 10963926751740 (9.9 TiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 8801210220 bytes 12331600148587 (11.2 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
RX (receive) represents received packets, while TX (transmit) represents transmitted packets.
RX errors: Indicates the total number of errors in receiving packets, including errors such as too-long-frames, Ring Buffer overflow, CRC checksum errors, frame synchronization errors, FIFO overruns, and missed packets, etc.
RX dropped: Indicates that packets have entered the Ring Buffer but were dropped during the copying process to memory due to system reasons such as insufficient memory.
RX overruns: Indicates FIFO overruns, which occur when the I/O transferred by the Ring Buffer (also known as Driver Queue) exceeds what the kernel can handle. The Ring Buffer refers to the buffer before an IRQ request is initiated. An increase in overruns means packets are being dropped by the network card’s physical layer before reaching the Ring Buffer. One cause of Ring Buffer overflow can be uneven distribution of interrupts (all on core0), as was the case with the problematic machine, leading to packet loss.
RX frame: Represents misaligned frames.
For TX (transmit), the increase in the above counters is mainly due to aborted transmissions, errors caused by carrier issues, FIFO errors, heartbeat errors, and window errors. Collisions indicate transmission interruptions caused by CSMA/CD.
Difference between dropped and overruns
Dropped: Indicates that the packet has entered the network card’s receive FIFO queue and has begun to be processed by the system interrupt for copying (from the network card’s FIFO queue to system memory). However, due to system reasons (e.g., insufficient memory), the packet is dropped, meaning it is discarded by the Linux system.
Overruns: Indicates that the packet was dropped before entering the network card’s receive FIFO queue, meaning the FIFO was full. The FIFO may be full because the system is too busy to respond to network card interrupts in time, causing packets in the network card to not be copied to system memory promptly. A full FIFO prevents subsequent packets from being accepted, resulting in the packet being discarded by the network card hardware. Therefore, if overruns are non-zero, it is advisable to check the CPU load and interrupt handling.
netstat -i will also provide information on the number of packets sent and received, as well as packet loss for each network interface.
# netstat -i Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 9528312730 0 0 0 8803615650 0 0 0 BMRU
Common Causes of Network Card Packet Loss in Linux
Based on my experience, common causes of packet loss in Linux network cards include Ring Buffer overflow, netdev_max_backlog overflow, Socket Buffer overflow, PAWS, etc.
Ring Buffer Overflow
If there are no issues with the hardware or driver, packet loss on the network card is generally due to a buffer (ring buffer) that is set too small. When the rate at which network data packets arrive (produce) is faster than the rate at which the kernel processes (consumes) them, the Ring Buffer will quickly fill up, and new incoming packets will be dropped.
# ethtool -S eth0|grep rx_fifo rx_fifo_errors: 0 # cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0: 10967216557060 9528860597 0 0 0 0 0 0 12336087749362 8804108661 0 0 0 0 0 0
You can view the statistics of packets dropped due to a full Ring Buffer using ethtool or /proc/net/dev. These statistics are marked with fifo:
If you notice that the fifo count for a particular network card on the server is continuously increasing, you should check if CPU interrupts are evenly distributed. You can also try increasing the size of the Ring Buffer. ethtool can be used to view the maximum value of the Ring Buffer for the network card device and to modify the current Ring Buffer settings.
# View the maximum value and current settings of the Ring Buffer for the eth0 network card $ ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 1024 RX Mini: 0 RX Jumbo: 0 TX: 1024 # Modify the receive and transmit hardware buffer size for the eth0 network card $ ethtool -G eth0 rx 4096 tx 4096 Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096
netdev_max_backlog Overflow
netdev_max_backlog is the buffer queue where packets are held after being received from the NIC and before being processed by the protocol stack (e.g., IP, TCP). Each CPU core has a backlog queue. Similar to the Ring Buffer, when the rate of receiving packets exceeds the rate at which the protocol stack processes them, the CPU’s backlog queue continues to grow. When it reaches the set netdev_max_backlog value, packets will be dropped.
# cat /proc/net/softnet_stat 2e8f1058 00000000 000000ef 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0db6297e 00000000 00000035 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 09d4a634 00000000 00000010 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0773e4f1 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Each row represents the status statistics of each CPU core, starting from CPU0 and going down sequentially. Each column represents various statistics for a CPU core: the first column represents the total number of packets received by the interrupt handler; the second column represents the total number of packets discarded due to netdev_max_backlog queue overflow. From the output above, it can be seen that in the statistics of this server, there are indeed packet losses caused by netdev_max_backlog.
The default value for netdev_max_backlog is 1000. On high-speed links, the second column in the statistics may show non-zero values. This can be resolved by modifying the kernel parameter net.core.netdev_max_backlog:
sysctl -w net.core.netdev_max_backlog=2000
Socket Buffer Overflow
Sockets can mask the differences between various protocols in the Linux kernel, providing a unified access interface for applications. Each socket has a read and write buffer.
The read buffer caches data sent from the remote end. If the read buffer is full, no new data can be received.
The write buffer caches data that is to be sent out. If the write buffer is full, the write operations of the application will be blocked.
In the case where the half-connection is full, if the syncookie mechanism is enabled, the SYN packets will not be directly discarded. Instead, a SYN+ACK packet with a syncookie will be sent back, designed to prevent SYN Flood attacks from making normal request services unavailable.
PAWS
The full name of PAWS is Protect Against Wrapped Sequence numbers, and its purpose is to address the issue where TCP sequence numbers might be reused within a single session under high bandwidth conditions.
As shown in the diagram above, a packet A1 sent by the client with sequence number A gets “lost” in the network for some reason and doesn’t reach the server in a timely manner. The client then times out and retransmits a packet A2 with the same sequence number A. Assuming that the bandwidth is sufficient, the sequence number space is exhausted, and A is reused. At this point, the server is waiting for a packet A3 with sequence number A. Coincidentally, the previously “lost” packet A1 arrives at the server. If the server determines the validity of a packet based solely on the sequence number A, it would pass incorrect data to the user-space program, causing program anomalies.
PAWS aims to solve this problem by relying on the timestamp mechanism. The theoretical basis is that, in a normal TCP flow, the timestamps of all received TCP packets should be monotonically non-decreasing. This allows the determination that packets with timestamps smaller than the highest timestamp processed by the current TCP flow are delayed duplicate packets, which can be discarded. In the example above, the server has processed packet Z, and the timestamp of the subsequently arriving packet A1 will inevitably be smaller than that of packet Z, so the server will discard the delayed packet A1 and wait for the correct packet to arrive.
The key to implementing the PAWS mechanism is that the kernel retains the most recent timestamp for each connection. With improvements, this can be used to optimize the rapid recycling of the server’s TIME_WAIT state.
The TIME_WAIT state is the final state that the party initiating the connection closure enters after the four-way handshake of a TCP connection closure. It usually needs to be maintained for 2*MSL (Maximum Segment Lifetime) in this state. It serves two purposes:
1. Reliably implementing the closure of a TCP full-duplex connection: During the four-way handshake process of closing the connection, the final ACK is sent by the party that initiates the connection closure (referred to as A). If this ACK is lost, the other party (referred to as B) will retransmit the FIN. If A does not maintain the TIME_WAIT state and directly enters the CLOSED state, it cannot retransmit the ACK, and B’s connection cannot be reliably released in a timely manner.
2. Waiting for “lost” duplicate packets in the network to expire: If A’s packet is “lost” and does not reach B in time, A will retransmit the packet. After A and B complete the transmission and disconnect, if A does not maintain the TIME_WAIT state for 2MSL, a new connection with the same source and destination ports could be established between A and B, and the “lost” packet from the previous connection might arrive and be processed by B, causing an exception. Maintaining the 2MSL time ensures that packets from the previous connection disappear from the network.
Connections in the TIME_WAIT state need to consume server memory resources to be maintained. The Linux kernel provides a parameter, tcp_tw_recycle, to control the rapid recycling of the TIME_WAIT state. The theoretical basis is:
Based on the theory of PAWS, if the kernel retains the most recent timestamp per host, and timestamps are compared when receiving packets, the second issue that TIME_WAIT aims to resolve—treating packets from a previous connection as valid in a new connection—can be avoided. Thus, there is no need to maintain the TIME_WAIT state for 2*MSL, but only to wait for sufficient RTO (Retransmission Timeout) to handle ACK retransmissions, achieving rapid recycling of TIME_WAIT state connections.
However, this theory presents new problems when multiple clients access the server using NAT: The timestamps of multiple clients behind the same NAT are difficult to keep consistent (the timestamp mechanism uses the system boot relative time). For the server, connections established by two clients each appear as two connections from the same peer IP. According to the per-host most recent timestamp record, it will be updated to the larger timestamp of the two client hosts, causing packets from the client with the relatively smaller timestamp to be discarded as expired duplicates by the server.
Using netstat, you can obtain statistics on packets discarded due to PAWS mechanism timestamp validation:
# netstat -s |grep -e "passive connections rejected because of time stamp" -e "packets rejects in established connections because of timestamp” 387158 passive connections rejected because of time stamp 825313 packets rejects in established connections because of timestamp
Use sysctl to check whether tcp_tw_recycle and tcp_timestamp are enabled:
$ sysctl net.ipv4.tcp_tw_recycle net.ipv4.tcp_tw_recycle = 1 $ sysctl net.ipv4.tcp_timestamps net.ipv4.tcp_timestamps = 1
If the server is providing services as a server, and it is clear that clients will access through NAT networks, or if there are Layer 7 forwarding devices that replace the client source IP, tcp_tw_recycle should not be enabled. However, timestamps are relied upon by other mechanisms besides supporting tcp_tw_recycle, so it is recommended to keep them enabled.
sysctl -w net.ipv4.tcp_tw_recycle=0 sysctl -w net.ipv4.tcp_timestamps=1
Where were the packets dropped?
1. Dropwatch
# dropwatch -l kas Initalizing kallsyms db dropwatch> start Enabling monitoring... Kernel monitoring activated. Issue Ctrl-C to stop monitoring 1 drops at sk_stream_kill_queues+50 (0xffffffff81687860) 1 drops at tcp_v4_rcv+147 (0xffffffff8170b737) 1 drops at __brk_limit+1de1308c (0xffffffffa052308c) 1 drops at ip_rcv_finish+1b8 (0xffffffff816e3348) 1 drops at skb_queue_purge+17 (0xffffffff816809e7) 3 drops at sk_stream_kill_queues+50 (0xffffffff81687860) 2 drops at unix_stream_connect+2bc (0xffffffff8175a05c) 2 drops at sk_stream_kill_queues+50 (0xffffffff81687860) 1 drops at tcp_v4_rcv+147 (0xffffffff8170b737) 2 drops at sk_stream_kill_queues+50 (0xffffffff81687860)
2. monitoring kfree_skb events with perf
# perf record -g -a -e skb:kfree_skb ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.212 MB perf.data (388 samples) ] # perf script containerd 93829 [031] 951470.340275: skb:kfree_skb: skbaddr=0xffff8827bfced700 protocol=0 location=0xffffffff8175a05c 7fff8168279b kfree_skb ([kernel.kallsyms]) 7fff8175c05c unix_stream_connect ([kernel.kallsyms]) 7fff8167650f SYSC_connect ([kernel.kallsyms]) 7fff8167818e sys_connect ([kernel.kallsyms]) 7fff81005959 do_syscall_64 ([kernel.kallsyms]) 7fff81802081 entry_SYSCALL_64_after_hwframe ([kernel.kallsyms]) f908d __GI___libc_connect (/usr/lib64/libc-2.17.so) 13077d __nscd_get_mapping (/usr/lib64/libc-2.17.so) 130c7c __nscd_get_map_ref (/usr/lib64/libc-2.17.so) 0 [unknown] ([unknown]) containerd 93829 [031] 951470.340306: skb:kfree_skb: skbaddr=0xffff8827bfcec500 protocol=0 location=0xffffffff8175a05c 7fff8168279b kfree_skb ([kernel.kallsyms]) 7fff8175c05c unix_stream_connect ([kernel.kallsyms]) 7fff8167650f SYSC_connect ([kernel.kallsyms]) 7fff8167818e sys_connect ([kernel.kallsyms]) 7fff81005959 do_syscall_64 ([kernel.kallsyms]) 7fff81802081 entry_SYSCALL_64_after_hwframe ([kernel.kallsyms]) f908d __GI___libc_connect (/usr/lib64/libc-2.17.so) 130ebe __nscd_open_socket (/usr/lib64/libc-2.17.so)
3. tcpdrop
TIME PID IP SADDR:SPORT > DADDR:DPORT STATE (FLAGS) 05:46:07 82093 4 10.74.40.245:50010 > 10.74.40.245:58484 ESTABLISHED (ACK) tcp_drop+0x1 tcp_rcv_established+0x1d5 tcp_v4_do_rcv+0x141 tcp_v4_rcv+0x9b8 ip_local_deliver_finish+0x9b ip_local_deliver+0x6f ip_rcv_finish+0x124 ip_rcv+0x291 __netif_receive_skb_core+0x554 __netif_receive_skb+0x18 process_backlog+0xba net_rx_action+0x265 __softirqentry_text_start+0xf2 irq_exit+0xb6 xen_evtchn_do_upcall+0x30 xen_hvm_callback_vector+0x1af 05:46:07 85153 4 10.74.40.245:50010 > 10.74.40.245:58446 ESTABLISHED (ACK) tcp_drop+0x1 tcp_rcv_established+0x1d5 tcp_v4_do_rcv+0x141 tcp_v4_rcv+0x9b8 ip_local_deliver_finish+0x9b ip_local_deliver+0x6f ip_rcv_finish+0x124 ip_rcv+0x291 __netif_receive_skb_core+0x554 __netif_receive_skb+0x18 process_backlog+0xba net_rx_action+0x265 __softirqentry_text_start+0xf2 irq_exit+0xb6 xen_evtchn_do_upcall+0x30 xen_hvm_callback_vector+0x1af