[En-Nut-Discussion] tcpsm crash on socket timeouts

Thu Oct 21 11:35:48 CEST 2010

Hi Marti!

Good investigation!

For keeping an eye on the network stack stability I have my EIR running 24/7. Now with the latest changes in the ethernet code over the last few month the need to reset the EIR has decreased from once every 2..3 days to about once every 2..3 weeks. That is a very good improvement. But I'd like to see that there is no reboot needed at all.

As the EIR is connected to a DSL provider that forces a reconnect every 24hours it is likely that there are coming in broken packages at that moment that might cause this race condition you discovered. What is a bit astonishing is the fact, that I expect the ARM7 to enter bus-fault or hard-fault if adressing unsupported memory addresses. But if I remember my last development on ARM7 this CPU is very tolerant. I think if I finished my stack for CortexM3 we would see the Hard-Faults if you try to access a NULL-Pointer.

If you discover more of these or if you find an elegant solution for what you discovered, please send a report here or provide a diff to the tracker at sourceforge.

Best regards
Ulrich

-----Original Message-----
From: Marti Raudsepp <marti at voicecom.ee>
To: Ethernut User Chat (English) <en-nut-discussion at egnite.de>
Cc: Maidu Raudsepp <maidu at voicecom.ee>
Sent: Wed, Oct 20, 2010 12:27 pm
Subject: [En-Nut-Discussion] tcpsm crash on socket timeouts

Hi list,

It seems that there is a race condition in the tcpsm thread when
handling transmit timeouts (and possibly in other connection abort
cases). This problem is only triggered if the TCP-using thread has
higher priority than the 'tcpsm' thread, so a workaround is lowering
your threar priority.

It all starts in NutTcpStateRetranTimeout(), which after
TCP_RETRIES_MAX calls NutTcpAbortSocket()

NutTcpAbortSocket does the following:
    else
        sock->so_state = TCPS_CLOSED;
    NutTcpDiscardBuffers(sock);
    NutEventBroadcast(&sock->so_rx_tq);
    NutEventBroadcast(&sock->so_tx_tq);
    NutEventBroadcast(&sock->so_pc_tq);
    NutEventBroadcast(&sock->so_ac_tq);

Now, if the socket thread has higher priority than tcpsm, the first
NutEventBroadcast will immediately switch away from tcpsm to the
socket thread.

The socket thread receives an error code from NutTcpSend/Receive, so
the most likely action is to call NutTcpCloseSocket()

NutTcpCloseSocket() calls into NutTcpStateCloseEvent() which does:
    switch (sock->so_state) {
    case TCPS_LISTEN:
    case TCPS_SYN_SENT:
    case TCPS_CLOSED:      // so_state was changed to TCPS_CLOSED above
        /*
         * No connection yet, immediately destroy the socket.
         */
        NutTcpDestroySocket(sock); // frees the memory used by sock

Eventually control passes back to tcpsm which is still in
NutTcpAbortSocket() and continues to use the recently freed socket
pointer. Since NutEventBroadcast does indirect function calls, this
can result in executing arbitrary pieces of memory. During our
testing, once it even corrupted the on-chip flash memory and wouldn't
boot anymore!

There are also other code paths that call NutTcpAbortSocket which
might be suspect. Simply changing NutEventBroadcast to
NutEventBroadcastAsync probaly won't work because the callees are
using NutThreadYield.

The right solution seems to be to assign the socket a different state
and let tcpsm free the socket asynchronously.

On the other hand, it may be that there are other cases where sockets
in TCPS_LISTEN/TCPS_SYN_SENT/TCPS_CLOSED are still held by the tcpsm
thread. Maybe freeing memory directly from NutTcpCloseSocket() is the
wrong thing in the first place? This should be verified.

Regards,
Marti Raudsepp
Voicecom OÜ
_______________________________________________
http://lists.egnite.de/mailman/listinfo/en-nut-discussion