katekarin's comments

katekarin · on Nov 22, 2022

My team had a similar issue with the ARP cache on AWS when we used Amazon Linux as an OS for cluster nodes, and Debian for the database host. When new tasks were starting some had random timeouts when connecting to the database.

It turned out that the Debian database host had bad ARP entries (an IP address was pointing to a non-existing MAC Address) caused by frequent reuse of the same IP addresses.

Debian has a default ARP cache size that's larger than Amazon Linux (I think it's entirely disabled on AL?).

As for the tooling we used to track it down, it was tcpdump. We saw SYN's getting sent, but not ACK's back. Few more tcpdump flags (-e shows the hardware addresses) and we discovered mismatched MAC addresses.