[En-Nut-Discussion] NutThreadRemoveQueue clears runQueue to NULL
Philipp Burch
phip at hb9etc.ch
Thu Aug 22 09:48:19 CEST 2013
Hi Ole!
On 08/20/2013 10:32 PM, Ole Reinhardt wrote:
> Hi Philip,
>
>> I've been tracking down a very annoying bug in some Ethernut project
>> during the past days. The bug caused the processor (Cortex-M3, LM3S9D90)
>> to generate a memory fault because it tried to execute data which is
>> marked as XN (execute never).
>
> Have you tried out my enhanced CM3 fault handlers that print out the
> register dump using the debug macro? I introduced them in trunk r5254.
Ok, I've integrated the debug macro for the LM3S now. It's available in
r5268 in the devnut_lm3s branch.
> For the LMxxx CPUs you would have to add a configurator option and a
> debug macro. See my changes in r5254 as an example.
>
> It would be very helpfull to see this dump.
>
> Very likely the stack pointer won't show the real stack location and
> other registers won't too. But the dump of the mentioned exception
> handler will show further registers (fault reason and fault address
> etc.) which may help to find the real location, that triggered the bug.
>
> I'm quite sure the real problem is not located in
> NutThreadRemoveQueue() but is a overwritten stack.
Well, it's not that easy, unfortunately. When entering the Memfault
handler, the SP already points into the flash memory region (0x19d90 in
the last try, the SRAM starts at 0x20000000). The debug macro is
therefore not able to print anything, because the first function call
will never return but escalate to a hard fault.
This is what the debugger says when I query the registers right after it
enters the Memfault handler:
-------- 8< ---------- 8< ------------
(gdb) info reg
r0 0x0 0
r1 0x200004d0 536872144
r2 0x1 1
r3 0x0 0
r4 0xf3efbf0c 4092575500
r5 0xf3ef8008 4092559368
r6 0xf04f8009 4031741961
r7 0xf0030105 4026728709
r8 0xbf00bfff 3204497407
r9 0x4603b084 1174646916
r10 0xf88d9100 4170027264
r11 0xf99d3007 4187828231
r12 0x1010101 16843009
sp 0x19d90 0x19d90
lr 0xfffffff9 4294967289
pc 0x19d5c 0x19d5c <IntMemfaultHandler>
xpsr 0x9000004 150994948
MSP 0x19d90 105872
PSP 0x2000183c 536877116
PRIMASK 0x0 0
BASEPRI 0x0 0
FAULTMASK 0x0 0
CONTROL 0x0 0
-------- 8< ---------- 8< ------------
When I now manually adjust the stack pointer to point to some valid SRAM
location, the debug macro works.
-------- 8< ---------- 8< ------------
(gdb) set $sp=0x20001000
Cannot access memory at address 0xf99d3007
Cannot access memory at address 0xf99d3007
(gdb) info reg
r0 0x0 0
r1 0x200004d0 536872144
r2 0x1 1
r3 0x0 0
r4 0xf3efbf0c 4092575500
r5 0xf3ef8008 4092559368
r6 0xf04f8009 4031741961
r7 0xf0030105 4026728709
r8 0xbf00bfff 3204497407
r9 0x4603b084 1174646916
r10 0xf88d9100 4170027264
r11 0xf99d3007 4187828231
r12 0x1010101 16843009
sp 0x20001000 0x20001000
lr 0xfffffff9 4294967289
pc 0x19d5c 0x19d5c <IntMemfaultHandler>
xpsr 0x9000004 150994948
MSP 0x20001000 536875008
PSP 0x2000183c 536877116
PRIMASK 0x0 0
BASEPRI 0x0 0
FAULTMASK 0x0 0
CONTROL 0x0 0
(gdb) cont
-------- 8< ---------- 8< ------------
Output on the serial console:
-------- 8< ---------- 8< ------------
---------------------------------------------------
[Mem Fault handler - all numbers in hex]
R0 = 0x20000ff4
R1 = 0x20000ffc
R2 = 0x20000ffc
R3 = 0x20001004
R12 = 0x20001004
LR [R14] = 0x2000100c
PC [R15] = 0x2000100c
PSR = 0x20001014
BFAR = 0xe000ed38
CFSR = 0x00000001
HFSR = 0x00000000
DFSR = 0x00000001
AFSR = 0x00000000
SCB_SHCSR = 0x00070001
MMADDR = 0xe000ed34
---------------------------------------------------
-------- 8< ---------- 8< ------------
I've added a line for printing the MMADDR register as well, as this
/may/ provide additional information about where the fault occured.
Interpreting the values, I see the following:
- It is a memory management fault (obviously)
- The fault was triggered by an instruction fetch from an XN location
- The MMADDR value does NOT point to the offending instruction
The last point is stated by the datasheet ("This fault occurs on any
access to an XN region, even when the MPU is disabled or not present.
When this bit is set, the PC value stacked for the exception return
points to the faulting instruction and the address of the attempted
access is not written to the MMADDR register.") and the MMADDR value
would be useless anyway, as 0xe000ed34 does not contain any code
(peripheral address space). The mentioned "stacked value" is not
accessible of course, as there was no valid stack to store it onto.
But this is all relatively useless, as I already figured out how the
memory fault was triggered. Please have a look at my first post:
> Anyway, by using conditional breakpoint statements directly in the code, I was able to find out that a call to NutThreadRemove caused the runQueue pointer of the scheduler to be set to NULL. The offending code is the marked line in this function, approx. line 160 in os/thread.c:
The runQueue is trashed, so the scheduler will read random values for
the SP and therefore branch to a completely wrong location when
switching to the next task.
So my question really is: Is it valid for a thread to have td_qnxt set
to NULL? If so, why is it possible that NutThreadRemove uses this value
to clear the runQueue pointer?
Thanks,
Philipp
More information about the En-Nut-Discussion
mailing list