[En-Nut-Discussion] Stuck TCP socket in Nut/OS 4.8.7
Coleman Brumley
cbrumley at polarsoft.biz
Tue May 5 04:02:12 CEST 2015
Sorry, if this is TL;DR, but I really need a TCP state machine guru to chime
in on this. If you'd like to skip to the good parts, I have a TCP socket
that has been stuck in the "CLOSING" state for nearly 20 minutes. I believe
this may be the root cause of my leaks, so I'm wondering if there's an
option I can set to make the stuck socket simply close after X amount of
time without changes. I have both SNDTIMEO and RCVTIMEO set to 1000.
Well, even after solving the "Creating stream device failed" issue (which
has been issues for years!), I'm still leaking memory.
I added some code to periodically display the TCP socket states and here's
what I think I'm seeing. By the way, both the SNDTIMEO and RCVTIMEO are set
to 1000. This seems to have helped, but I'm still seeing a "stuck" TCP
socket state. Nearly 8 minutes after closing IE, there is still a "hanging"
TCP socket in the CLOSING state.
I've run my HTTP test using both Chrome and IE, and it looks more like IE is
the culprit. The way the web interface works is via JavaScript/AJAX and the
JS makes periodic calls to get small amounts of data from the controller.
At startup, I display all of the TCP socket states. And, they're what I
expect:
19:00:05 12-31-1969:
Heap Available=22984
00209DDC 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209D24 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209C6C 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209BB4 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209AFC 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209A44 0.0.0.0:80
0.0.0.0:0 -->LISTEN
I think open a new Chrome window and connect to the controller:
21:35:00 05-04-2015:
Heap Available=22744
00209AEC 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A664 192.168.3.76:80
192.168.3.10:60644 -->TIMEWAIT
0020B400 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209E34 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209D7C 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020B500 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AE04 0.0.0.0:80
0.0.0.0:0 -->LISTEN
Then, after 40 seconds, I open a second Chrome window and connect to the
controller:
21:35:40 05-04-2015:
Heap Available=21628
0020AE90 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020B1C4 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209A18 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A7DC 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AFA4 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A640 192.168.3.76:80
192.168.3.10:60716 -->TIMEWAIT
00209B4C 192.168.3.76:80
192.168.3.10:60717 -->TIMEWAIT
0020A4D0 192.168.3.76:80
192.168.3.10:60714 -->TIMEWAIT
0020ADB0 192.168.3.76:80
192.168.3.10:60715 -->TIMEWAIT
00209CF8 192.168.3.76:80
192.168.3.10:60718 -->TIMEWAIT
0020B10C 192.168.3.76:80
192.168.3.10:60719 -->TIMEWAIT
00209E28 192.168.3.76:80
192.168.3.10:60720 -->TIMEWAIT
0020A588 0.0.0.0:80
0.0.0.0:0 -->LISTEN
Then, after one minute, I close the first Chrome window:
21:36:40 05-04-2015:
Heap Available=22904
0020A6C4 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020B15C 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209B88 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AC4C 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209D28 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209E14 0.0.0.0:80
0.0.0.0:0 -->LISTEN
Then, after another minute, I close the second Chrome window:
21:37:40 05-04-2015:
Heap Available=22988
0020AA18 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A894 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A7DC 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209CB8 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A960 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209E14 0.0.0.0:80
0.0.0.0:0 -->LISTEN
As you can see, at this point, the heap is back where we started (22988
available) and all of the sockets have closed.
However, with IE (IE 11, BTW), I get subtly different behavior.
Here are the socket states after running with 2 IE, but after closing the
first:
21:39:00 05-04-2015:
Heap Available=21968
0020A588 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A7DC 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AAC4 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AC30 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209BB4 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209AFC 192.168.3.76:80
192.168.3.10:61012 -->TIMEWAIT
0020AD18 192.168.3.76:80
192.168.3.10:61013 -->CLOSING
0020A69C 192.168.3.76:80
192.168.3.10:61006 -->TIMEWAIT
0020A4D0 192.168.3.76:80
192.168.3.10:60984 -->CLOSING
00209A18 0.0.0.0:80
0.0.0.0:0 -->LISTEN
The highlighted entry is the troublesome one.
Here's the entry after closing the second IE window:
21:39:20 05-04-2015:
Heap Available=22652
00209B88 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209C40 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A640 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209AD0 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AA98 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A4D0 192.168.3.76:80
192.168.3.10:60984 -->CLOSING
00209A18 0.0.0.0:80
0.0.0.0:0 -->LISTEN
As you can see 0020A4D0 is still "CLOSING".
And, here I am, nearly 20 minutes later:
21:57:22 05-04-2015:
Heap Available=22640
0020AA38 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209E10 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A8D0 0.0.0.0:80
0.0.0.0:0 -->LISTEN
00209AD0 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020AB7C 0.0.0.0:80
0.0.0.0:0 -->LISTEN
0020A4D0 192.168.3.76:80
192.168.3.10:60984 -->CLOSING
00209A18 0.0.0.0:80
0.0.0.0:0 -->LISTEN
I don't have a Wireshark capture of what IE does when it closes. My
assumption is that it correctly closes the sockets on its side (I could be
wrong, or course). But, it seems odd that after 20 minutes, that Nut/OS has
not destroyed this connection.
Is there another option I should set to "lower the bar" for this behavior?
Coleman
From: Coleman Brumley [mailto:cbrumley at polarsoft.biz]
Sent: Monday, May 04, 2015 7:39 PM
To: en-nut-discussion at egnite.de
Subject: RE: File descriptor leak in 4.8.7
Ok, sorry for the triple post, but it turns out the addition of a mutex to
the code below solved the "Creating stream device failed" error.
The new code has now been running for around 45 minutes with 3 browsers
connected and no error indications. I'm waiting for the next 30 minute
interval report to see if the heap continues to fall.
Coleman
From: Coleman Brumley [mailto:cbrumley at polarsoft.biz]
Sent: Monday, May 04, 2015 5:48 PM
To: en-nut-discussion at egnite.de <mailto:en-nut-discussion at egnite.de>
Subject: RE: File descriptor leak in 4.8.7
After doing some arithmetic, it appears that each occurrence of "Creating
stream device failed" causes a 1271 byte leak from heap space. Does that
number ring a bell?
I've re-verified that this does not happen unless a browser is making HTTP
requests of the controller. If the controller is just "sitting there", there
is no leak. I verify this via a printout of available heap space every 30
minutes. After running for 4 hours, there were no leaks.
While there is no set time between each occurrence, this does seem to happen
about 5 times in 30 minutes.
The errno value 23 means that too many files are open, correct? FOPEN_MAX is
defined as 8, so is having six concurrent HTTP threads running too many? Is
it possible that the use of multiple HTTP threads without using a mutex is
causing the file open handles to corrupt somehow and therefore not be closed
correctly?
Coleman
From: Coleman Brumley [mailto:cbrumley at polarsoft.biz]
Sent: Monday, May 04, 2015 1:19 PM
To: en-nut-discussion at egnite.de <mailto:en-nut-discussion at egnite.de>
Subject: File descriptor leak in 4.8.7
Everyone,
I'm working with Nut/OS v4.8.7 (for many, many reasons, I CANNOT upgrade the
core OS). The processor is an Atmel SAM7X256.
I'm encountering a problem with the use of file descriptors in HTTP, and I
believe the core effect is a resource leak.
Here is my HTTP service code:
THREAD(Service, arg)
{
TCPSOCKET *sock;
FILE *stream;
u_char id = (u_char) ((uptr_t) arg);
u_char timeout=0;
time_t t;
struct _tm tmx ;
uint16_t lowheapcounter=1000;
NutThreadSetPriority(72);
/*
* Now loop endless for connections.
*/
for (;;) {
/*
* Create a socket.
*/
if ((sock = NutTcpCreateSocket()) == 0)
{
printf("[%u] Creating socket failed\n", id);
NutSleep(5000);
continue;
}
/*
* Listen on port 80. This call will block until we get a connection
* from a client.
*/
NutTcpAccept(sock, 80);
/*
* Detect low heap and if it happens for too long, reset.
*
*/
while (NutHeapAvailable() < 4096) {
t=time(0l);
localtime_r(&t,&tmx);
printf("%02d:%02d:%02d %02d-%02d-%04d -- ",
tmx.tm_hour,
tmx.tm_min,
tmx.tm_sec,
tmx.tm_mon+1,
tmx.tm_mday,
tmx.tm_year+1900);
printf("[%u] Low mem (%d)\n", id,
(int)NutHeapAvailable());
NutSleep(1000);
lowheapcounter--;
if(lowheapcounter<=0) Hreset();
}
/*
* Associate a stream with the socket so we can use standard I/O
calls.
*/
while (((stream = _fdopen((int) ((uptr_t) sock), "r+b")) ==
0)&&(++timeout<0xFF))
{
NutSleep(10);
}
if(!stream)
{
printf("\n[%u] Creating stream device failed w/
errno:%d\n", id,errno);
NutTcpCloseSocket(sock);
continue;
}
else
{
NutHttpProcessRequest(stream);
fclose(stream);
NutTcpCloseSocket(sock);
}
}
}
The console output shows all is normal initially, but after a while (approx.
765 seconds, initially after boot time) I start seeing"
[3] Creating stream device failed w/ errno:23
Where the number in braces is the thread id. I'm running 6 HTTP threads,
started as follows:
for (i = 1; i <= 6; i++)
{
char thname[] = "httpd0";
thname[5] = '0' + i;
NutThreadCreate(thname, Service, (void *) (uptr_t)
i,HTTPD_SERVICE_STACK);
}
The stack size is defined as 1024.
I noticed this question was asked some time ago on this list, but I could
not determine if there was ever a solution posted. There are other I/O
processes running in the controller, but it ONLY runs low on resources when
a browser is connected to it.
Does anyone have any suggestions as to what I could look into as to what
might be the problem?
While the browser(s) are connected, the system appears to lose about 4.5k of
heap space per 30 minutes!
Thanks in advance.
Best Regards,
Coleman
More information about the En-Nut-Discussion
mailing list