[En-Nut-Discussion] ICMP echo request response failure on ARP cache timeout

Harald Kipp harald.kipp at egnite.de
Fri May 22 17:37:41 CEST 2009


Hi all,

This long post deals with inner details of Nut/OS. If you are not
interested or not able to follow, you can simply ignore it.

The issue I'm describing here exists at least since the ARP thread had
been removed from Nut/OS (Feb. 2005). It is actually no big deal. By
default an ARP entry times out after 10 minutes. This time is adjustable
in the Configurator. When an ARP cache entry times out, sending an ICMP
(Ping) response fails.

ICMP responses are somewhat special in Nut/OS. Let's look to the normal
sequence when pinging Nut/OS from a PC:

1. Ping application is started on the PC with the IP address of a Nut/OS
node.

2. The PC broadcasts an ARP request to find out the Ethernet MAC address
for the specified IP address

3. The receiver thread in the Nut/OS NIC driver receives the ARP packet
and calls
3.1 NutEtherInput
3.2 NutArpInput
3.3 NutArpCacheUpdate to add the PC's IP to the ARP cache.
3.4 NutArpOutput
3.5 Send routine of the NIC driver

4. PC receives the ARP response and stores the MAC/IP relation in its
ARP cache.

5. PC sends out the ICMP echo request.

5. The receiver thread in the Nut/OS NIC driver receives the ARP packet
and calls
5.1 NutEtherInput
5.2 NutIpInput
5.3 NutIcmpInput
5.4 NutIcmpReflect
5.5 NutIcmpOutput
5.6 NutIpOutput
5.7 NutArpCacheQuery to retrieve the ARP entry cached in step 3.3.
5.8 Send routine of the NIC driver

6. PC receives the response and continues at step 5.

The interesting step is 3.3, where Nut/OS caches the MAC/IP relation of
the PC. Note, that this is done on ARP requests, because it is most
likely, that a remote hosted sending an ARP request intends to talk to
us. By caching requests we do not need to send out an ARP query ourself
when responding to IP requests.

However, entries in the ARP cache must be removed if they reach a
specific age, with Nut/OS after 10 minutesby default, as mentioned
above. In this case step 5.7 fails to get the MAC address from the
cache. Instead Nut/OS sends out an ARP request to re-new the PCs IP/MAC
relation. It will then wait for the response, but nothing happens. Why?

Simply because the receiver thread in the NIC driver itself is blocked
in NutArpCacheQuery. After 500ms (default) NutArpCacheQuery gives up and
returns an error, being unable to send out the ICMP echo. Back in the
NIC receiver loop it now receives the ARP response from the PC, adding
it to the ARP cache. The next ICMP echo request will be answered again
without any problem.

Why is this ICMP specific? Well, sending out responses internally
(without application intervention) is only done by ICMP and TCP. The TCP
state machine runs its own thread. All incoming packets are stored in a
queue. The TCP state machine thread processes this queue and calls the
NIC send routine, if required. The same is true for other protocols,
where the application thread actually calls the send routine. Only in
ICMP the NIC receiver itself calls the send routine. If the send routine
is blocked, no incoming packets are processed.

How to solve this? Not sure. I can think of the following:

A. Re-implementing the ARP thread is for sure not the best idea, because
threads consume a lot of RAM for their stack.

B. In many other implementations all outgoing packets waiting for ARP
responses are queued. They are sent out as soon as the related ARP
response arrives. In certain situations this may consume even more RAM
than the threaded solution. Furthermore, the queue must be checked in
regular time intervals to remove packets, for which no response was
received within a specific time.

To state it once again: Losing a single ICMP packet should not be
considered an error. If, for example, an ARP packet gets lost, the same
will happen in any implementation that follows the specifications.  By
definition, the upper layer is responsible for retries, not ARP. In the
initial implementation Nut/OS did ARP retries, which was miserably wrong.

However, losing a ping reply every 10 minutes doesn't look good.

Harald








More information about the En-Nut-Discussion mailing list