[En-Nut-Discussion] [en-nut-discussion] thread stops executing after some time.
Erik Lindstein
erik at ledutveckling.com
Tue Mar 25 12:20:52 CET 2008
Ernst, thank you very much for taking time to answer.
I'l write some comments down below
I understand this is a sporadic problem so it takes a lot of time to run
into the "error" situation, but anyway:
1) ... only 1 out of 4 clients where still connecting ...
When you experience this situation more than once, is it always the same
Ethernut which can still connect or is that also "random". And what looks
"random" in the first place, is it really?
--------------------------------
Well, i don´t think there is much that involves software that is truly
random :-) so ofcorse this issnt either.
But here it can be any one of the clients that stops executing the
thread, usuly i can see that after some time(~6 - 7h) one or two
stoped connecting and there can be one left running for up to 24h
(perhaps longer). But in the end all of them stops trying to connect
and the thread sets sleep time to "None".
--------------------------------
2) .. (all of them uses the same software just different MACs and IPs ) ..
Even if all of them use the same SW, are they operating under
same/similar/different conditions? (I mean ".. exchanges some XML data,..":
where does this data come from?, i.e. how is it generated and how different
can it be between the Ethernuts?
--------------------------------
When the server software on the PC is running and the PC is connected
then the client socket gets connected and the client sends some values
read from the A/D, some status variables and then the PC responds with
a command that tells the client to do "something".
Usuly the PC just sends a "reset watchdog command" to the client.
But in this case everything workes fine as long as the software is
running and the PC is connected.
When i then close down the server software the client gets a command
that tells it to start reseting the WDT localy.
But when the problem occurs the software in the client can´t get to
that point where it actualy exchanges the data because the socket
shoudnt be able to connect.
" if(NutTcpConnect(socket, rip, TCPSERVERPORT) == 0) "
If the server(rip:TCPSERVERPORT) issnt connected to the LAN it can´t
return 0 correct?
And even if it for some reason does the socket read timeouts should
occur and it should manage to get past the data exchange rutines and
then start all over again.
--------------------------------
3) How about buffer overflows due to "special" tx/rx data conditions
(length)?
--------------------------------
In this case it only happens when (atleast think that) I don´t read or
send any data more than the data that the tcpsm sends out trying to
connect the socket.
My code dossnt do any rx/tx until the socket and stream is OK.
--------------------------------
4) Try looking "into" the TCP sockets. My bank switch test-program at
http://www.es-business.com/Firma/eng/edocs.htm may help. Include cli.c and
dump.c in your main pgm and create a thread as indicated in the source.
It contains a Telnet based CLI which has a "lists" command which walks thru
and displays all Nut/OS known lists (TCP Sockets is one of these). One word
of caution: Because Nut/OS (i.e. other threads) are executing while this
command follows the pointer in the various lists pointing from one entry to
the next, the command may loop in case a list is updated (by an or on behalf
of an app thread) right when this pointer in the list is used by the "lists"
command itself.
The dump command may help you peek around on RAM.
--------------------------------
Il look into that, thanks..
--------------------------------
5) Is there a possibility to have wireshark monitoring the TCP/IP link up
until the PC gets disconnected? This way, you could find out what was
exchanged immediately before the disconnect happened and maybe this gives
more info about the internal status of the Ethernuts and the TCP/IP
connection states.
--------------------------------
The PC only gets disconnected when i remove the LAN cable but i could
monitor the data until that point.
But i can disconnect and connect the cable many ( unlimited? ) times
and there is no problems. The clients always connects again if i dont
leave the PC unconnected for a longer period of time ( > ~5-6h )
One possibility might be to have the switch i use setup to echo all
trafic out on another port and monitor the trafic there with
wireshark. That way i might be able to se what happens before it stops
working.
But if i have the PC connected in the "normal" way the problem dossnt occur.
--------------------------------
6) Do you log the state of the TCP/IP connection between the Ethernuts and
the PC within the PC? Maybe such log (record length / contents) could
provide some more info.
--------------------------------
Because of the problem only occuring when the PC issnt connected this
is hard to do.
I can log the trafic when everything workes fine but not sure it gives
away the problem but perhaps someone with more knowledge of TCP/IP can
se some something here.
--------------------------------
7) The most important question is:
Is the problem caused by behaviour in Nut/OS or Nut/Net (IP stack, timers,
events etc)
Or
Is the problem cause by some behaviour in the application threads.
Is there any chance to "strip down" the application threads to try to
minimize their possible impact on the situation?
--------------------------------
I minimized the software to only include the thread for client socket
and one tcpserver thread. But this still happends. Il try to remove
some more code in the client thread and se if it changes anything.
I did get the feeling that this software took longer time before it
stoped trying to connect. But thats not 100% verified. Anyway still
it stops.
--------------------------------
8) I have an example here of a test app, which produces the following
threads list:
CLI>threads
Name Status Prio Stack Memory Timeout INFO-addr Bank
CMDLINE Run 64 891 OK None 36C9 -1
XHTST Sleep 64 357 OK 6 3203 9
XHTST Ready 64 357 OK None 2F3D 8
XHTST Sleep 64 357 OK None 2C77 7
XHTST Ready 64 357 OK None 29B1 6
XHTST Sleep 64 357 OK 24 26EB 5
XHTST Ready 64 357 OK None 2425 4
XHTST Sleep 64 357 OK 13 215F 3
XHTST Ready 64 357 OK None 1E99 2
XHTST Sleep 64 357 OK 6 1BD3 1
tcpsm Sleep 32 468 OK 102 1925 -1
XHTST Sleep 64 357 OK None 16C1 0
rxi5 Sleep 9 603 OK 699 145F -1
main Sleep 200 705 OK 940 1041 -1
idle Ready 254 356 OK None D21 -1
The XHTST threads a looping apps who work in memory, display info via TCP/IP
to a telnet client and sometimes sleep for a random time.
There are threads which are Sleeping and do not have a Timeout associated
with them! (maybe when they are waiting for the telnet output to complete?)
--------------------------------
I have no idea but perhaps when
" NutTcpConnect(socket, rip, TCPSERVERPORT) " executes it sets the
thread to "None" and then waits for some event from the tcpsm that
then never occurs.
--------------------------------
I am quite sure you have thought about some (if not all) of this already,
but maybe it "kicks" off some more thoughts.
Good luck
Regards
Ernst
-----Ursprüngliche Nachricht-----
Von: en-nut-discussion-bounces at egnite.de
[mailto:en-nut-discussion-bounces at egnite.de] Im Auftrag von Erik Lindstein
Gesendet: Montag, 24. März 2008 17:13
An: en-nut-discussion at egnite.de
Betreff: [En-Nut-Discussion] Thread stops executing after some time.
Guys please help me out.
I'm on a wild goose chase trying to figure out what is happening with
a thread that handles communications with a PC thru a tcp/ip socket.
The setup is:
Ethernut V2.1
Software 4.4.0
The software is build up by a couple of threads each handling some
functions( lcd, push buttons, user functions etc )
Then i have one thread that communicates with a server software on my PC.
The communication is pretty simple.
It's a client socket that connects to my server PC and then exchanges
some XML data, disconnects, sleeps for 300ms and then start all over
again.
This works fine for weeks without any problems if i have the server PC
up and running and connected to the same LAN my ethernut is connected
to.
Then one day i disconnected the server PC from the LAN and left a
couple of the ethernut clients running over the weekend, then on
Monday i connected my PC again and started up the server software but
i noticed that only 1 out of 4 clients where still connecting (all of
them uses the same software just different MACs and IPs )
I looked at the incoming traffic with wireshark and could not see any
sign of life at all from the 3 clients not connecting.
I tried to ping them and they all answer on pings and also all other
threads that handles the LCD and push buttons are still up and running
so the software is not dead.
I tested to deactivate/activate the network connection on my PC to see
if anyone of the clients woke up. No luck.
I then added another thread to the software ( i took the sample code
for the tcps in the apps dir and created a thread to run that code )
And when everything is ok i see the "inetd" thread timeout counting
all the time and the thread executes as expected.
When the inetd thread stops executing i can connect to the unit and i
get the output seen below:
----------------------------------------------------------------------------
--------------------------------------------
220 List of threads with name,state,prio,stack,mem,timeout follows
tcpsm Sleep 32 461 OK 27
TcpS Run 64 2546 OK None
inetd Sleep 64 2381 OK None
rxi5 Sleep 9 603 OK 1392
wdt Sleep 40 255 OK 8
SmuTh Sleep 64 65 OK 71
PcuTh Sleep 64 805 OK 1
HvpsTh Sleep 64 605 OK 24
IppsTh Sleep 64 965 OK 4
TaTh Sleep 64 65 OK 35
LcdTh Sleep 64 929 OK 34
main Sleep 64 733 OK 451
idle Ready 254 356 OK None
----------------------------------------------------------------------------
--------------------------------------------
For some reason the thread(inetd) just gets a timeout set to "None"
instead of the NutSleep value.
I only have one place in the code that sets the thread to sleep and i
have a fixed value there of 300. (NutSleep(300))
So there must be somewhere else in the code the thread gets set to
some wait state, but i have no idea how to figure out where and why
this happens.
Can it be something that happens when the socket tries to connect to a
IP/Server that doesn't exist on the LAN.
If it happens all the time it would be easier to figure out whats
wrong but this can run for days without happening.
Also if there was low memory the tcps thread wouldn't answer the
incoming connection attempts i guess.
The thread code is below:
----------------------------------------------------------------------------
--------------------------------------------
THREAD(InetdThread, arg)
{
TCPSOCKET *socket;
FILE *stream = 0;
u_long rip = inet_addr("192.168.0.115");
u_long tmo = 500;
int socket_error = 0;
uint8_t *start = 0, *stop = 0;
uint8_t unit[20], cmd[40], value[40];
uint8_t data_exchange_buffer[100] = "0";
for(;;)
{
if ((socket = NutTcpCreateSocket()) != 0)
{
NutTcpSetSockOpt(socket, SO_RCVTIMEO, &tmo,
sizeof(tmo));
NutTcpSetSockOpt(socket, SO_SNDTIMEO, &tmo,
sizeof(tmo));
if(NutTcpConnect(socket, rip, TCPSERVERPORT) == 0)
{
stream = _fdopen((int) ((uptr_t) socket),
"r+b");
if(stream != 0)
{
fprintf_P(stream, info_P,
INFO_P_ARGS); // Send some XML DATA
fflush(stream);
fgets(data_exchange_buffer,
sizeof(data_exchange_buffer),
stream); // Get some XML DATA
{
// Handle XML data
}
fclose(stream);
/*
info_text is a extern
variable that another thread prints on the
LCD for debug output.
*/
sprintf(info_text ,"COK\n%lu",
(u_long)NutGetMillis());
}
}
else
{
socket_error = NutTcpError(socket);
sprintf(info_text ,"CE:%d \n%lu",
socket_error, (u_long)NutGetMillis());
}
NutTcpCloseSocket(socket);
}
NutSleep(300);
}
}
----------------------------------------------------------------------------
--------------------------------------------
--
/Erik
More information about the En-Nut-Discussion
mailing list