[En-Nut-Discussion] Race condition in TCP statemachine

Sun Jan 12 23:27:48 CET 2014

Hi Harald,

Am 12.01.2014 13:21, schrieb Harald Kipp:
> Your post looks to me, as if Nut/Net is highly unstable "all over the
> place". :-) My experience shows, that it is rock solid, running nodes
> 7d/24h without any interruption.

;-) Sorry, that was not my intention. In fact it works really reliable.
And I only found the bug, while doing heavy duty tests with millions of
connects between two devices. The race condition resulted in an
exception only once every 400.000 to 750.000 connections and only in
situations, where I used short living connections.

> Why my experience does not fit with your findings?

I think the reason is, that under normal conditions the race condition
does not lead to a bug. It only happens, if one tries to destroy the
exactly the socket, which is currently iterated over in the TCP
statemachine loop.

Further more it only happens, if an external thread is waiting to run,
while the TCP statemachine is doing its work.

>> NutTcpDestroySocket() is called from different places within the TCP
>> statemachine, as well as indirectly from NutTcpCloseSocket() in some
>> situation.
> 
> As long as it is called in the state machine only, this would not cause
> any problem, right?

Yes, I think you are right. I checked the code again. Looks like the
sock variable os nowhere used any longer in the statemachine, after any
of the sub-functions called NutTcpDestroy().

> However, I can see one instance, where a socket is destroyed from within
> another context via NutTcpCloseSocket -> NutTcpStateCloseEvent.
> 
> In this case the socket is destroyed only, if no connection had been
> established on this socket.

Yes.

> Now it fits again: In my applications sockets are rarely closed before
> having established at least one connection. In such situations it is
> very unlikely, that the state machine handles existing connections. That
> makes Nut/Net _looking_ highly reliable, although it contained a severe
> race condition bug since its initial release in 2001.

Exactly. There have been one or more places, where a scheduling point
exists in the statemachin, that can give the CPU to a thread that then
will close a socket that have not yet been opened.

I found it as I introduced the TCP connect timeout.

In case of a timeout during connect, the socket will be closed before a
connection was established.

But even then it does not necessary lead to a bug, as this only happens,
if the same socket is currently processed by the statemachine.

So as mentioned above, the real bug happend only once every few hundret
thousand connects.

Hopefully the "garbage collector" implementation now fixes this.

Best regards,

Ole

-- 
kernel concepts GmbH            Tel: +49-271-771091-14
Sieghuetter Hauptweg 48         Mob: +49-177-7420433
D-57072 Siegen
http://www.embedded-it.de
http://www.kernelconcepts.de