[En-Nut-Discussion] Strange networking bug
Curtis Maloney
cmaloney at cardgate.net
Wed Nov 29 09:28:14 CET 2006
Sadly, I don't have much information on the specific cause, but I think
people should be aware of this instance.
We had two X-Nuts running Nut/OS 4.1.9. One unit had happily been
ticking away on the network for some time, the second added only recently.
Suddenly, as we were preparing final tests before a major go-live, they
both stopped talking. But not quite. And not stopped working.
Basically, the core thread of the application was still churning away,
polling devices on the serial port. So, at least that much was ok.
If I telnetted to them, I'd either get immediately disconnected, or get
one line of feedback, and then dropped.
Our current theory is that some "poison packet" caused the TCP task to
go into a strange state. Our reasoning is as follows:
1) The two units are geographically separate. They reside in different
buildings approx 100m apart. So it wasn't likely to be a local noise
glitch.
2) One unit had been reset earlier that day, whilst the other had been
on for several days. So it wasn't a slow leak or time based issue.
3) Both units went at the same time, and exhibited EXACTLY the same
behavior.
I haven't, as yet, had any time to devote to tracing my way through the
networking stack to find what could have possibly gone wrong.
Hopefully, though, this message might remind someone of something they
thought previously was harmless, or unlikely.
Of course, if there's been significant changes between 4.1.9 and 4.2 TCP
code, maybe the fault is already gone...
--
Curtis Maloney
cmaloney at cardgate.net
More information about the En-Nut-Discussion
mailing list