

The capture confirmed the alerts from the monitoring system. the clients sending those packets did not receive acknowledgements and retransmit the packets. In this case, the gaps were real as later in time the retransmissions for those missing packets could be seen, i.e. Tcpdump may not be able to capture all traffic and create those gaps by itself. The capture did show gaps in the TCP sequence numbers. It could still be that the monitoring system is right as the packets are dropped on the machine, possibly after capturing. If the monitoring system showed packet loss, but the TCP streams in the capture do not, then the monitoring system could be wrong. Could the monitoring system be wrong? We captured network traffic with tcpdump. The network seems fine, the applications do not seem overload, yet packets are dropped. I also checked the other applications on the box and as expected their load was way less and did not show any signs of overload. The periods with dropped packets did not coincide with the periods having the highest packet rate. The same held for the network traffic measured at the NIC level. Also some periods of high load did not show any packet loss. Some of the packet loss periods coincided with those spikes, but not all. The CPU load showed some spikes, but nothing excessive. This is a single threaded application handling approximately 7000 TCP connections. Nothing indicated that it was overloaded. All metrics indicated that the application was running normally. CPU load was not very high, the incoming traffic was not out of the ordinary. So I analyzed the performance metrics of this application. Given that on this box one application was running close to its maximum capacity, I expected that this application could not handle the load during some periods during the day, something we had seen before. It seemed most likely that the applications received more traffic than they could handle. The drop count is shown by the ifconfig command. The packet drops were on incoming traffic. The monitoring system indicated that the packet loss was on a linux box running several server applications that process data sent by our trading machines for analysis by our traders. In this case the network link was reliable fibre, the network link was far from congested. It can be that the network transport is unreliable and packet loss is natural, the network link could be congested, applications cannot handle the offered load. There can be various reasons for packet loss. So I took the alert seriously and started searching for the cause. However, packet loss is often an early sign that something is wrong with the systems, i.e. The number of lost packets was low and they were TCP packets, so TCP retransmission would overcome the loss. There was no observable degradation of the functionality of the system. The loss was not excessive, only a couple of hundred packets, nothing compared to the billions of packets flowing through our systems.

Recently I was alerted by our monitoring system that it detected network packet loss. Hope it’s useful to anyone facing similar issues. Hi all, just wanted to share a recent experience I had investigating a packet drop issue on a linux system.

NovemSearching for the cause of dropped packets on Linux.
