[En-Nut-Discussion] ICMP echo request response failure on ARP cache timeout

Fri May 22 19:59:21 CEST 2009

Hey Harald,It's interesting you ran across this now.  I've had issues with
the same architecture of
the IP code.  With Ethernet it's almost unnoticeable that the rx thread is
sending something even
with an ARP request and reply in the middle of the mix, but with PPP it
becomes more noticable
on AVR (rx and tx buffers are small for ahdlc dev and the pipe is much
easier to saturate).
While PPP doesn't use ARP you're still blocking the reception of more than
256 bytes after the
reception of the ping request.
I have another serially accessible network interface which behaves much
worse under this type of
activity (rx thread sending).

I toyed with the idea of introducing tx threads per network device (didn't
this exist a long time ago?)
, but then threads are unlimited in the speed with which they may add to the
tx queue.  To get
around this I thought about adding an event to netbufs which the thread that
generated the packet
could (but wouldn't have to) wait to be signaled by the tx thread.  This
event would be broadcast
posted before a netbuf was freed.  This method would have required a lot of
changes to the way
large portions of the networking code and drivers worked, the programming
interfaces, and also
would introduce issues with the possibility of having a NETBUF* that you
didn't know if it had been
freed or not (btw, the "who frees a netbuf" within nut/net is very
confusing).
I liked this idea for the ability to keep any and all network interfaces
sending at top speed as long as
there was data to send (well, if the arp code were reworked to allow an IP
packets to be tied to
a list of outstanding arp requests and either moved to a ready to send queue
when the request was
filled or terminated if it wasn't so that the eth_tx thread wouldn't have to
wait for arp).
I think that there are other subtle implementation issues with this as well.

Another idea was for general purpose helper threads which could either be
generated on demand and
do the work needed when blocking the current thread is unacceptable.  The
interface to this got tricky,
and I never have done it.  Also I was worried about how to throttle this
while not overly hindering it's
usefulness.

So far the need to ping has not been great enough to put that much effort
into it.

Networking code is hard.

Nathan

On Fri, May 22, 2009 at 11:37 AM, Harald Kipp <harald.kipp at egnite.de> wrote:

> Hi all,
>
> This long post deals with inner details of Nut/OS. If you are not
> interested or not able to follow, you can simply ignore it.
>
> The issue I'm describing here exists at least since the ARP thread had
> been removed from Nut/OS (Feb. 2005). It is actually no big deal. By
> default an ARP entry times out after 10 minutes. This time is adjustable
> in the Configurator. When an ARP cache entry times out, sending an ICMP
> (Ping) response fails.
>
> ICMP responses are somewhat special in Nut/OS. Let's look to the normal
> sequence when pinging Nut/OS from a PC:
>
> 1. Ping application is started on the PC with the IP address of a Nut/OS
> node.
>
> 2. The PC broadcasts an ARP request to find out the Ethernet MAC address
> for the specified IP address
>
> 3. The receiver thread in the Nut/OS NIC driver receives the ARP packet
> and calls
> 3.1 NutEtherInput
> 3.2 NutArpInput
> 3.3 NutArpCacheUpdate to add the PC's IP to the ARP cache.
> 3.4 NutArpOutput
> 3.5 Send routine of the NIC driver
>
> 4. PC receives the ARP response and stores the MAC/IP relation in its
> ARP cache.
>
> 5. PC sends out the ICMP echo request.
>
> 5. The receiver thread in the Nut/OS NIC driver receives the ARP packet
> and calls
> 5.1 NutEtherInput
> 5.2 NutIpInput
> 5.3 NutIcmpInput
> 5.4 NutIcmpReflect
> 5.5 NutIcmpOutput
> 5.6 NutIpOutput
> 5.7 NutArpCacheQuery to retrieve the ARP entry cached in step 3.3.
> 5.8 Send routine of the NIC driver
>
> 6. PC receives the response and continues at step 5.
>
> The interesting step is 3.3, where Nut/OS caches the MAC/IP relation of
> the PC. Note, that this is done on ARP requests, because it is most
> likely, that a remote hosted sending an ARP request intends to talk to
> us. By caching requests we do not need to send out an ARP query ourself
> when responding to IP requests.
>
> However, entries in the ARP cache must be removed if they reach a
> specific age, with Nut/OS after 10 minutesby default, as mentioned
> above. In this case step 5.7 fails to get the MAC address from the
> cache. Instead Nut/OS sends out an ARP request to re-new the PCs IP/MAC
> relation. It will then wait for the response, but nothing happens. Why?
>
> Simply because the receiver thread in the NIC driver itself is blocked
> in NutArpCacheQuery. After 500ms (default) NutArpCacheQuery gives up and
> returns an error, being unable to send out the ICMP echo. Back in the
> NIC receiver loop it now receives the ARP response from the PC, adding
> it to the ARP cache. The next ICMP echo request will be answered again
> without any problem.
>
> Why is this ICMP specific? Well, sending out responses internally
> (without application intervention) is only done by ICMP and TCP. The TCP
> state machine runs its own thread. All incoming packets are stored in a
> queue. The TCP state machine thread processes this queue and calls the
> NIC send routine, if required. The same is true for other protocols,
> where the application thread actually calls the send routine. Only in
> ICMP the NIC receiver itself calls the send routine. If the send routine
> is blocked, no incoming packets are processed.
>
> How to solve this? Not sure. I can think of the following:
>
> A. Re-implementing the ARP thread is for sure not the best idea, because
> threads consume a lot of RAM for their stack.
>
> B. In many other implementations all outgoing packets waiting for ARP
> responses are queued. They are sent out as soon as the related ARP
> response arrives. In certain situations this may consume even more RAM
> than the threaded solution. Furthermore, the queue must be checked in
> regular time intervals to remove packets, for which no response was
> received within a specific time.
>
> To state it once again: Losing a single ICMP packet should not be
> considered an error. If, for example, an ARP packet gets lost, the same
> will happen in any implementation that follows the specifications.  By
> definition, the upper layer is responsible for retries, not ARP. In the
> initial implementation Nut/OS did ARP retries, which was miserably wrong.
>
> However, losing a ping reply every 10 minutes doesn't look good.
>
> Harald
>
>
>
>
>
> _______________________________________________
> http://lists.egnite.de/mailman/listinfo/en-nut-discussion
>