[En-Nut-Discussion] NutThreadRemoveQueue clears runQueue to NULL

Fri Aug 23 09:47:44 CEST 2013

Hi Harald!

On 08/22/2013 03:41 PM, Harald Kipp wrote:
> Hi Philipp,
>
> Although I'm the author of a large part here, I have the same problem
> following the initial intention. That code is quite old. Last relevant
> changes had been done in r1513, more than 7 years ago.
>
> This essential kernel code is used by all applications and it is most
> unlikely, that a bug survived for such a long time. However, assuming

Please don't understand me wrong, I'm not blaming you or Nut/OS for this 
bug. I just can't continue with debugging in a sensible way as long as I 
don't know how those things are supposed to work.
It is not particularly likely that my code overwrites exactly this 
single word in the NUTTHREADINFO structure, but also not impossible.

> that neither your application code nor other parts of the OS overwrite
> that queue, I can think of two possible causes:
>
> 1. You are using Nut/OS in an uncommon way.
>
> For example, are you using the idle thread as part of your application?
> I'm aware, that this possibility has been recently introduced via
> NutRegisterIdleCallback(), but it isn't as well tested and documented as
> the rest of the thread API. Using less common coding features may
> disclose problems or missing explanations.
>

I don't think I'm using it in such an "uncommon" way in my application. 
The idle thread is just there, untouched.

>
> 2. The compiler creates different code.
>
> Specifically those old parts of the OS had been created at times, when
> optimizers had been less aggressive. Code runs well, even if you didn't
> tell the compiler the full truth. The stack manipulation of the 8-bit
> AVR implementation NutEnter/ExitCritical() is an example. Until now,
> everything worked fine, so no one wants to touch this inner code. But we
> have to be aware, that the code is not as robust as it could be.
> Changing the compiler may disclose problems.
>
> On 19.08.2013 17:44, Philipp Burch wrote:
>> But interestingly, the bug shows up in debug
>> as well as in release (optimized) code in the same way.
>
> If debug and release means without and with optimization, then this may
> be the first hint, that the trouble is not caused by the compiler.

Correct, optimizations on/off does not seem to have a big impact on this 
behaviour.

>
> In general, trying to find Nut/OS kernel bugs in a complex application
> is not a good idea. If any specific part of Nut/OS raise suspicion, it
> helps a lot to write a minimalist test application to reproduce the bug.
>

This is correct, but in this case very hard to do. As I noted, already a 
slight change in the code (such as printing a few less characters to the 
UART) makes the bug disappear, so it's quite hard to reproduce it with a 
simple test application. But I suppose there's no other way than to 
incrementally remove functionality and check the behaviour after each 
change.

>
>
> OK, for the rest of this reply keep in mind, that I, same as you, need
> to find out, what's going on in this ancient code. My statements may be
> wrong.
>
>> Anyway, by using conditional breakpoint statements directly in the code,
>> I was able to find out that a call to NutThreadRemove caused the
>> runQueue pointer of the scheduler to be set to NULL. The offending code
>> is the marked line in this function, approx. line 160 in os/thread.c:
>
> I don't think that Nut/OS anticipates the runQueue becoming NULL. That
> would mean, that no thread is ready to run. That's were the idle thread
> jumps in. If all other threads are waiting, the idle thread is running
> until another thread becomes ready to run again. This implies, that idle
> thread callbacks cannot call blocking functions.
>

Ok, this sounds interesting. But what do you mean by "no thread is ready 
to run"? I suppose it is very common that there are moments in which all 
application threads are either waiting for input or for a timeout. Would 
it then be reasonable for the runQueue to become NULL or not? I can't 
completely follow you in this paragraph.

>
>> Until here, I was able to figure out what is going wrong. But now, I
>> have the problem that I do not really know how those queues are supposed
>> to work. Is it legal for a thread to have td_qnxt pointing to NULL? Is
>
> In general, yes. In the specific case of the runQueue: No.
>
>> it legal for NutThreadRemoveQueue() to set *tqpp to NULL? If so, why can
>> this happen to the runQueue as well?
>
> dito.

So could I assume that the idle thread always needs to be ready and 
therefore the runQueue needs to point there whenever all other threads 
are waiting?

Would you mind posting a short comment about how the scheduler of Nut/OS 
works? Or is there even a document about this topic?

Looking at the code, I see the following:

The runQueue always points to the thread which is running at the moment 
(but what is runningThread, then?). If this thread wants to block, a 
function calls NutThreadRemoveQueue(td, runQueue) to remove itself from 
the runQueue and make it point to the next thread which is ready. Which 
thread this should be is stored in the td_qnxt field of the running 
thread. So what should be there if no other thread wants to run? Should 
it point to the idle thread? If so, who is supposed to ensure this?

Thanks for your help!

Regards,
Philipp