[En-Nut-Discussion] Race condition in TCP statemachine

Ole Reinhardt ole.reinhardt at embedded-it.de
Fri Jan 10 18:26:43 CET 2014


Hi all,

while implementing a TCP connect timeout, I found a severe problem
(racecondition) in the TCP statemachine.

I hopefully fixed it in SVN trunk rev. 5541.

What is the problem:

NutTcpDestroySocket() is called from different places within the TCP
statemachine, as well as indirectly from NutTcpCloseSocket() in some
situation.

NutTcpDestroySocket() first removed the given socket from
tcpSocketList and then frees the allocated memory.

Problem:
========

tcpSocketList is a linked list of the currently existing TCP sockets. A
call to NutTcpDestroySocket() removes the given socket from this list by
re-linking the so_next-> pointer. The socket which is removed can stay
at any place in the list. So it is not necessary the first or the last
entry.

The problem is, that tcpSocketList is used at several places in Nut/OS
without any kind of locking. Mainly it is used in THREAD(NutTcpSm, arg)
to iterate over its entries.

Race condition:
===============

While iterating over the socket list, the socket list itself can be
modified, which means that in some situations, it could happen that the
socket pointer is freed and unlinked by a call to NutTcpDestroySocket(),
while it is used later in the loop again.

Especially the main interation loop in tcpsm.c line 1728 (now) contains
several scheduling points, where other threads could call
NutTcpCloseSocket() which in succession may call NutTcpDestroySocket(),
depending on the socket state. Further more NutTcpDestroySocket() is
called from within the main iteration loop itself.

Solution:
=========

* Instead of calling NutTcpDestroySocket() from several places within
  the statemachine (or from NutTcpCloseSocket()), a new socket state is
  introduced: TCPS_DESTROY. If the socket shall be destroyed, the socket
  state will be changed to TCPS_DESTROY and the flag tcp_run_gc is set.

* The statemachine was modified accordingly to handle this new
  state correctly and reject any communication in this state.

* A new function NutTcpGarbadgeCollect() is added, which is called by
  the TCP statemachine, while it is idle, provided tcp_run_gc is set.
  This function itereates over all sockets and unlinks / frees
  ressources of all sockets in TCPS_DESTROY state.

* The garbage collector runs outside of any loop iterating over the
  socket list and does not have any scheduling point.

NutTcpDestroySocket() was modified to just free the allocated memory of
the socket, but does not unlink the socket from the socket list any more.

Further unclean implementations:
================================

There is one more function in the TCP statemachine, which modifies the
tcpSocketList.

NutTcpCreateSocket() allocates memory for a new socket and adds the new
socket to the root of the socket list.

As there are still several scheduling points in the TCP statemachine
mainloop, there is a chance, that NutTcpCreateSocket is called by
another thread, while looping over the socket list. So again, the socket
list is modified, while the TCP statemachine is iterating over it.

Fortunately this does not harm, as the new socket is always added to the
root of the list, while the iteration is done towards the end of the
list. The next iteration will then start with the newly added entry.



I implemented several reliability tests. A little tool which creates and
closes connections to a PC in a very fast loop. I used it to stress test
the TCP statemachine and did not find any further problems.

But I'd like to ask anybody to review my last changes and take some time
to think about the correctness of my garbadge collector implementation
and possible further race condition sitiations...

Best regards,

Ole Reinhardt

-- 
kernel concepts GmbH            Tel: +49-271-771091-14
Sieghuetter Hauptweg 48         Mob: +49-177-7420433
D-57072 Siegen
http://www.embedded-it.de
http://www.kernelconcepts.de


More information about the En-Nut-Discussion mailing list