[En-Nut-Discussion] NutThreadRemoveQueue clears runQueue to NULL

Harald Kipp harald.kipp at egnite.de
Fri Aug 23 18:14:05 CEST 2013


Hi Philipp,

On 23.08.2013 09:47, Philipp Burch wrote:
> On 08/22/2013 03:41 PM, Harald Kipp wrote:

>> This essential kernel code is used by all applications and it is most
>> unlikely, that a bug survived for such a long time. However, assuming
> 
> Please don't understand me wrong, I'm not blaming you or Nut/OS for this

I fully understood your request.


>> For example, are you using the idle thread as part of your application?
> 
> I don't think I'm using it in such an "uncommon" way in my application.
> The idle thread is just there, untouched.

The idle thread was my main hope to explain the problem. OK, busted.


>> In general, trying to find Nut/OS kernel bugs in a complex application
>> is not a good idea. If any specific part of Nut/OS raise suspicion, it
>> helps a lot to write a minimalist test application to reproduce the bug.
>>
> 
> This is correct, but in this case very hard to do. As I noted, already a
> slight change in the code (such as printing a few less characters to the
> UART) makes the bug disappear, so it's quite hard to reproduce it with a

Definitely. However, if a specific API call is suspicious, a simple
stress test may help to trigger the problem.


> simple test application. But I suppose there's no other way than to
> incrementally remove functionality and check the behaviour after each
> change.

Just from my personal experience: Step by step reduction and testing is
quite time consuming. I'd still recommend a specially crafted stress test.


>> I don't think that Nut/OS anticipates the runQueue becoming NULL. That
>> would mean, that no thread is ready to run. That's were the idle thread
>> jumps in. If all other threads are waiting, the idle thread is running
>> until another thread becomes ready to run again. This implies, that idle
>> thread callbacks cannot call blocking functions.
>>
> 
> Ok, this sounds interesting. But what do you mean by "no thread is ready
> to run"? I suppose it is very common that there are moments in which all
> application threads are either waiting for input or for a timeout. Would
> it then be reasonable for the runQueue to become NULL or not? I can't
> completely follow you in this paragraph.

Sorry, my explanation was indeed vague. The runQueue contains all
threads that are ready to run. This queue is always sorted by tread
priority. The idle thread never calls any blocking function and
therefore will always be ready to run and will be always kept in the
runQueue. It calls NutThreadYield() though, which is not blocking, but
will do a context switch if another thread with higher priority becomes
ready to run.

Saying that, I have a new idea, which would explain your problem...

Explanation 1: The idle thread is running at lowest possible priority.
Typically all other threads are running at higher priorities.

(You can skip this. The confusing part is, that this had been 255 in
early releases, which didn't support thread termination. Then thread
termination was first implemented by setting the thread's priority to
255 and redefine the lowest possibly priority for a running thread to
254. This was a simple hack and had been changed back later. But, if I
remember correctly, it was not fully reversed, because some applications
were already created, which killed threads by setting its priority to
255. OK, this part explains the magic about thread priorities 254 and
255. Btw. everyone probably is aware, that lower priority values mean
higher priorities.)

Explanation 2: If two threads are running at the same priority, each
direct or indirect call to NutThreadYield() will result in a context
switch as well. This guarantees, that threads are switched, even if they
are not waiting for any event.

Now combining these two explanations: Normally all other threads are
running at priorities above 254. Therefore the idle thread is the last
entry in the runQueue, _always_. However, if another thread is also
running at priority 254, then the idle thread may lose its position at
the end of the queue. That may cause trouble, if the other thread is
blocked. While it is generally allowed to run threads at priority 254,
those threads are not expected to call any blocking function.

Are you running any thread at priority 254. If yes, we found the
problem. (Otherwise: Grrr... busted again. In that case: Are you
changing priorities often?)


> Would you mind posting a short comment about how the scheduler of Nut/OS
> works? Or is there even a document about this topic?

There is an old document from 2002. Not sure, how much had been changed
since then.

http://www.ethernut.de/pdf/entet100.pdf

A few additional things are explained in

http://www.ethernut.de/pdf/enswm28e.pdf

and

http://www.ethernut.de/pdf/enmem21e.pdf

> Looking at the code, I see the following:
> 
> The runQueue always points to the thread which is running at the moment
> (but what is runningThread, then?). If this thread wants to block, a

Not really. The runQueue is a linked list of all threads that are
ready-to-run, with the highest priority thread on top and, typically,
the idle thread at the bottom.

The runningThread points to the thread that is _actually_ running.
Because of its cooperative nature, a higher priority thread may be
ready-to-run, sitting on top of the runQueue, but is not the thread that
is running. Only when the runningThread gives up control, the higher
priority thread is able to take over.

Let me know, if you need further explanation.

Regards,

Harald




More information about the En-Nut-Discussion mailing list