[En-Nut-Discussion] Debugging Watchdogs via OS Thread Switching

Tue Aug 25 00:00:55 CEST 2009

Hi Timothy!

In total not a bad idea, but two things:

How will a high priority task let the watchdog struck? I mean, yea, most 
of the system crashes in the development phase will be caused by driver 
problems and these crashes will stop any code execution. So that hiprio 
task will get killed too.

But if the software is 'finished' (I know, software is never finished) 
everything runs smoothly. Onc in a while a task will hang, but it's only 
one of the four controlling a big motor in a complex machine... And your 
high-prio task is stealing exatly these 2 microseconds the other task 
had needed not to hang up. Only a simple race condition is neede for 
that. And your watchdog ist reset and reset and reset, cause that hiprio 
task is still living :)

<Example>
We investigated a problem on one of our systems, where after some time a 
radio connection gets lost. The tasks are all running except one. But 
that one is not killing the system, it just yields.
The problem was caused by a concurrent access to one SPI device by two 
different callers. The task was writing through a buffer, the other task 
did a direct access to the device. The driver wasn't aware and after 
sending out the second tasks data it never finished the first ones, so 
that one simply yield.
A watchdog producing a freezeframe of the system might help to find that 
problem. Cause debug outpu slowes down the system and the tasks shift in 
time. By adding more and more debug code, the concurrent accesses of the 
SPI devic get less end less. So we had to spread debug code carfully and 
bit by bit.
A watchdog put into each thread one by one is not that intrusive and 
could help. But the watchdog would never have struck if it is called in 
a high prio task.

Now we have another one: The system hangs every about four weeks... Yep. 
I think I'll have another dozend grey hairs when I found that one.
</Example>

In my understanding a watchdog has to options where it is to be resettet 
before tri00ggering:
1) You have a very very important task that needs to be executed in a 
certain time again and again
2) The watchdog is reset in the idle task. The idle task has the lowes 
priority and resetting the WD here ensures, that every task is giving 
time to any other task in the chain and therefore now and then the idle 
task gets some system time too.

Oh, so much text, but here comes the second comment:
Investigating in a thing that can save freezeframes of the system if it 
crashes is something I am interested in too. But I'd like to see it as 
an option in the nutconf. Have a look on one of the first folders there, 
there you'll find NutO/S Debugging. There it should appear as an 
configureable option to switch it on or off as needed.

Best regards,
Ulrich

Timothy M. De Baillie wrote:
> I was thinking today about adding some debug code to the function 
> NutThreadSwitch in context_icc.c.  Let me explain my thinking here.
> 
> I have a watchdog timer set for some time (say 5 seconds).
> My watchdog is kept at bay by a high priority (low in number) thread 
> that sleeps for at least 1 second at a time (via a NutSleep(1000)).  
> (this is the highest priority thread in the software)
> I need to debug which thread is causing a watchdog in a very complex 
> multi-threaded system.
> Therefore, it would be "nice" if the OS saved the state of the last 
> context switch.
> 
> Looking through the OS code, all of the "yielding" functions ultimately 
> call NutThreadResume, which then calls NutThreadSwitch.  From my 
> understanding of the code (by inspection), I could save the thread I am 
> going to switch to, the total amount of RAM free, and the watchdog timer 
> information in a protected piece of RAM for later inspection from the 
> NutTheadSwitch function.
> 
> So my implementation would include modifying the OS Ram size to give me 
> some bytes of RAM to play with above the Heap.  Every NutThreadSwitch 
> would save the information I listed above.  Additionally I was wondering 
> if there was an easy way to get the specific line of code (code space 
> address) of the return path of the yield in both the "last thread" and 
> "next thread".  This would help narrow down the problem to where in a 
> thread the problem might occur.  On a watchdog, the first thing my 
> software would do, would be to read the contents of the RAM above the 
> Heap and report it. 
> 
> I understand that these extra calls will slow down the context switch, 
> and therefore I would only add it in with a DEBUG compiler definition. 
> 
> Does anyone have any thoughts or suggestions on the implementation of this?
> 
> Thanks in advance,
> 
> Tim
> _______________________________________________
> http://lists.egnite.de/mailman/listinfo/en-nut-discussion