[En-Nut-Discussion] Strange networking bug

Curtis Maloney cmaloney at cardgate.net
Wed Nov 29 09:28:14 CET 2006


Sadly, I don't have much information on the specific cause, but I think 
people should be aware of this instance.

We had two X-Nuts running Nut/OS 4.1.9.  One unit had happily been 
ticking away on the network for some time, the second added only recently.

Suddenly, as we were preparing final tests before a major go-live, they 
both stopped talking.  But not quite.  And not stopped working.

Basically, the core thread of the application was still churning away, 
polling devices on the serial port.  So, at least that much was ok.

If I telnetted to them, I'd either get immediately disconnected, or get 
one line of feedback, and then dropped.

Our current theory is that some "poison packet" caused the TCP task to 
go into a strange state.  Our reasoning is as follows:

1) The two units are geographically separate.  They reside in different 
buildings approx 100m apart.  So it wasn't likely to be a local noise 
glitch.
2) One unit had been reset earlier that day, whilst the other had been 
on for several days.  So it wasn't a slow leak or time based issue.
3) Both units went at the same time, and exhibited EXACTLY the same 
behavior.

I haven't, as yet, had any time to devote to tracing my way through the 
networking stack to find what could have possibly gone wrong. 
Hopefully, though, this message might remind someone of something they 
thought previously was harmless, or unlikely.

Of course, if there's been significant changes between 4.1.9 and 4.2 TCP 
code, maybe the fault is already gone...


--
Curtis Maloney
cmaloney at cardgate.net




More information about the En-Nut-Discussion mailing list