[En-Nut-Discussion] NutThreadRemoveQueue clears runQueue to NULL

Philipp Burch phip at hb9etc.ch
Thu Aug 22 09:48:19 CEST 2013


Hi Ole!

On 08/20/2013 10:32 PM, Ole Reinhardt wrote:
> Hi Philip,
>
>> I've been tracking down a very annoying bug in some Ethernut project
>> during the past days. The bug caused the processor (Cortex-M3, LM3S9D90)
>> to generate a memory fault because it tried to execute data which is
>> marked as XN (execute never).
>
> Have you tried out my enhanced CM3 fault handlers that print out the
> register dump using the debug macro? I introduced them in trunk r5254.

Ok, I've integrated the debug macro for the LM3S now. It's available in 
r5268 in the devnut_lm3s branch.

> For the LMxxx CPUs you would have to add a configurator option and a
> debug macro. See my changes in r5254 as an example.
>
> It would be very helpfull to see this dump.
>
> Very likely the stack pointer won't show the real stack location and
> other registers won't too. But the dump of the mentioned exception
> handler will show further registers (fault reason and fault address
> etc.) which may help to find the real location, that triggered the bug.
>
> I'm quite sure the real problem is not located in
> NutThreadRemoveQueue() but is a overwritten stack.

Well, it's not that easy, unfortunately. When entering the Memfault 
handler, the SP already points into the flash memory region (0x19d90 in 
the last try, the SRAM starts at 0x20000000). The debug macro is 
therefore not able to print anything, because the first function call 
will never return but escalate to a hard fault.

This is what the debugger says when I query the registers right after it 
enters the Memfault handler:

-------- 8< ---------- 8< ------------

(gdb) info reg
r0             0x0	0
r1             0x200004d0	536872144
r2             0x1	1
r3             0x0	0
r4             0xf3efbf0c	4092575500
r5             0xf3ef8008	4092559368
r6             0xf04f8009	4031741961
r7             0xf0030105	4026728709
r8             0xbf00bfff	3204497407
r9             0x4603b084	1174646916
r10            0xf88d9100	4170027264
r11            0xf99d3007	4187828231
r12            0x1010101	16843009
sp             0x19d90	0x19d90
lr             0xfffffff9	4294967289
pc             0x19d5c	0x19d5c <IntMemfaultHandler>
xpsr           0x9000004	150994948
MSP            0x19d90	105872
PSP            0x2000183c	536877116
PRIMASK        0x0	0
BASEPRI        0x0	0
FAULTMASK      0x0	0
CONTROL        0x0	0

-------- 8< ---------- 8< ------------

When I now manually adjust the stack pointer to point to some valid SRAM 
location, the debug macro works.

-------- 8< ---------- 8< ------------

(gdb) set $sp=0x20001000
Cannot access memory at address 0xf99d3007
Cannot access memory at address 0xf99d3007
(gdb) info reg
r0             0x0	0
r1             0x200004d0	536872144
r2             0x1	1
r3             0x0	0
r4             0xf3efbf0c	4092575500
r5             0xf3ef8008	4092559368
r6             0xf04f8009	4031741961
r7             0xf0030105	4026728709
r8             0xbf00bfff	3204497407
r9             0x4603b084	1174646916
r10            0xf88d9100	4170027264
r11            0xf99d3007	4187828231
r12            0x1010101	16843009
sp             0x20001000	0x20001000
lr             0xfffffff9	4294967289
pc             0x19d5c	0x19d5c <IntMemfaultHandler>
xpsr           0x9000004	150994948
MSP            0x20001000	536875008
PSP            0x2000183c	536877116
PRIMASK        0x0	0
BASEPRI        0x0	0
FAULTMASK      0x0	0
CONTROL        0x0	0
(gdb) cont

-------- 8< ---------- 8< ------------

Output on the serial console:

-------- 8< ---------- 8< ------------

---------------------------------------------------
[Mem Fault handler - all numbers in hex]

R0        = 0x20000ff4
R1        = 0x20000ffc
R2        = 0x20000ffc
R3        = 0x20001004
R12       = 0x20001004
LR [R14]  = 0x2000100c
PC [R15]  = 0x2000100c
PSR       = 0x20001014
BFAR      = 0xe000ed38
CFSR      = 0x00000001
HFSR      = 0x00000000
DFSR      = 0x00000001
AFSR      = 0x00000000
SCB_SHCSR = 0x00070001
MMADDR    = 0xe000ed34
---------------------------------------------------

-------- 8< ---------- 8< ------------

I've added a line for printing the MMADDR register as well, as this 
/may/ provide additional information about where the fault occured.

Interpreting the values, I see the following:
- It is a memory management fault (obviously)
- The fault was triggered by an instruction fetch from an XN location
- The MMADDR value does NOT point to the offending instruction

The last point is stated by the datasheet ("This fault occurs on any 
access to an XN region, even when the MPU is disabled or not present. 
When this bit is set, the PC value stacked for the exception return 
points to the faulting instruction and the address of the attempted 
access is not written to the MMADDR register.") and the MMADDR value 
would be useless anyway, as 0xe000ed34 does not contain any code 
(peripheral address space). The mentioned "stacked value" is not 
accessible of course, as there was no valid stack to store it onto.


But this is all relatively useless, as I already figured out how the 
memory fault was triggered. Please have a look at my first post:

> Anyway, by using conditional breakpoint statements directly in the code, I was able to find out that a call to NutThreadRemove caused the runQueue pointer of the scheduler to be set to NULL. The offending code is the marked line in this function, approx. line 160 in os/thread.c:

The runQueue is trashed, so the scheduler will read random values for 
the SP and therefore branch to a completely wrong location when 
switching to the next task.

So my question really is: Is it valid for a thread to have td_qnxt set 
to NULL? If so, why is it possible that NutThreadRemove uses this value 
to clear the runQueue pointer?

Thanks,
Philipp


More information about the En-Nut-Discussion mailing list