[En-Nut-Discussion] NutThreadRemoveQueue clears runQueue to NULL

Philipp Burch phip at hb9etc.ch
Mon Aug 19 17:44:23 CEST 2013


Hi everyone,

I've been tracking down a very annoying bug in some Ethernut project 
during the past days. The bug caused the processor (Cortex-M3, LM3S9D90) 
to generate a memory fault because it tried to execute data which is 
marked as XN (execute never). The debugger showed a completely trashed 
stack pointer, so that I had no idea where the trigger of the fault 
actually was located. The problem with single-stepping is that the bug 
is very "fragile": Break the program flow somewhere in the critical code 
and everything runs fine. But interestingly, the bug shows up in debug 
as well as in release (optimized) code in the same way.

Anyway, by using conditional breakpoint statements directly in the code, 
I was able to find out that a call to NutThreadRemove caused the 
runQueue pointer of the scheduler to be set to NULL. The offending code 
is the marked line in this function, approx. line 160 in os/thread.c:

--------- 8< ----------- 8< --------------

if (tqp != SIGNALED) {
     while (tqp) {
         if (tqp == td) {
             NutEnterCritical();
             *tqpp = td->td_qnxt;  /* td->td_qnxt may be NULL. WHY? */
             if (td->td_qpec) {
                 if (td->td_qnxt) {
                     td->td_qnxt->td_qpec = td->td_qpec;
                 }
                 td->td_qpec = 0;
             }
             NutExitCritical();

             td->td_qnxt = 0;
             td->td_queue = 0;
             break;
         }
         tqpp = &tqp->td_qnxt;
         tqp = tqp->td_qnxt;
     }
}

--------- 8< ----------- 8< --------------

The last time I've got the bug, td pointed to the idle thread, which had 
td_qnxt set to NULL. I'm not sure however, if it is always the idle 
thread which triggers the problem, maybe others can do this too. *tqpp 
obviously points to the global runQueue. The call stack shows a call to 
fputs() to the standard UART output as the root. The following functions 
are then called:

fputs() -> _write() -> UsartWrite() -> UsartPut() -> UsartFlushOutput() 
-> NutEventWait() -> NutThreadRemoveQueue()

Until here, I was able to figure out what is going wrong. But now, I 
have the problem that I do not really know how those queues are supposed 
to work. Is it legal for a thread to have td_qnxt pointing to NULL? Is 
it legal for NutThreadRemoveQueue() to set *tqpp to NULL? If so, why can 
this happen to the runQueue as well?

Please give me advice on where to look for the real cause of the 
problem. Could it be a crappy pointer target somewhere which clears 
td_qnxt? Or a buffer overrun probably? Or is there a bug somewhere in 
the kernel (which I don't hope)?

Thanks,
Philipp


More information about the En-Nut-Discussion mailing list