[En-Nut-Discussion] NutThreadRemoveQueue clears runQueue to NULL

Thu Aug 22 15:41:50 CEST 2013

Hi Philipp,

Although I'm the author of a large part here, I have the same problem
following the initial intention. That code is quite old. Last relevant
changes had been done in r1513, more than 7 years ago.

This essential kernel code is used by all applications and it is most
unlikely, that a bug survived for such a long time. However, assuming
that neither your application code nor other parts of the OS overwrite
that queue, I can think of two possible causes:

1. You are using Nut/OS in an uncommon way.

For example, are you using the idle thread as part of your application?
I'm aware, that this possibility has been recently introduced via
NutRegisterIdleCallback(), but it isn't as well tested and documented as
the rest of the thread API. Using less common coding features may
disclose problems or missing explanations.

2. The compiler creates different code.

Specifically those old parts of the OS had been created at times, when
optimizers had been less aggressive. Code runs well, even if you didn't
tell the compiler the full truth. The stack manipulation of the 8-bit
AVR implementation NutEnter/ExitCritical() is an example. Until now,
everything worked fine, so no one wants to touch this inner code. But we
have to be aware, that the code is not as robust as it could be.
Changing the compiler may disclose problems.

On 19.08.2013 17:44, Philipp Burch wrote:
> But interestingly, the bug shows up in debug
> as well as in release (optimized) code in the same way.

If debug and release means without and with optimization, then this may
be the first hint, that the trouble is not caused by the compiler.

In general, trying to find Nut/OS kernel bugs in a complex application
is not a good idea. If any specific part of Nut/OS raise suspicion, it
helps a lot to write a minimalist test application to reproduce the bug.

OK, for the rest of this reply keep in mind, that I, same as you, need
to find out, what's going on in this ancient code. My statements may be
wrong.

> Anyway, by using conditional breakpoint statements directly in the code,
> I was able to find out that a call to NutThreadRemove caused the
> runQueue pointer of the scheduler to be set to NULL. The offending code
> is the marked line in this function, approx. line 160 in os/thread.c:

I don't think that Nut/OS anticipates the runQueue becoming NULL. That
would mean, that no thread is ready to run. That's were the idle thread
jumps in. If all other threads are waiting, the idle thread is running
until another thread becomes ready to run again. This implies, that idle
thread callbacks cannot call blocking functions.

> Until here, I was able to figure out what is going wrong. But now, I
> have the problem that I do not really know how those queues are supposed
> to work. Is it legal for a thread to have td_qnxt pointing to NULL? Is

In general, yes. In the specific case of the runQueue: No.

> it legal for NutThreadRemoveQueue() to set *tqpp to NULL? If so, why can
> this happen to the runQueue as well?

dito.

Regards,

Harald