[En-Nut-Discussion] ARM Alignment for Threads

Thu Oct 10 16:37:34 CEST 2013

All,

The best solution I can think of right now is to:

0. Keep the current HEAPNODE structure (don't add anything).
1. Initialize the heap on a boundary of '4' or 'c'. This puts the hn_next member(s) on a boundary of '0' or '8'.
2. Constrain 'need' to multiples of 8 from a base of 4. That is, allocated memory can be 4, 12, 20, 28, etc.

This will add up to 7 unused bytes, but will result in the returned "&node->hn_next" being on a boundary of '0' or '8'.

Best,

Bob Wirka
Realtime Control Works

On Thursday, October 10, 2013 8:32 AM, Harald Kipp <harald.kipp at egnite.de> wrote:

Hi Ole,

On 10.10.2013 13:46, Ole Reinhardt wrote:
>>> My ugly hack was to set NUTMEM_ALIGNMENT to 8, and then to add 4 bytes of padding in HEAPNODE between 'size' and 'next'. The result is that threads get a stack that's aligned to 8 bytes, and va_arg() seems to work.

> Do we need it on AT91 (and other ARMs) as well?

I haven't been involved, but ss far as I followed the discussion, it is
required for all ARM CPUs.

>> If we think about more elegant solutions, we should check, whether it
>> makes still sense to allocate the thread stack from heap. For example,
>> the application may provide a pointer to the stack when creating the
>> thread. Furthermore, I recently did some research on shared stacks.
> 
> Do you just have any findings or ideas how to improve stack allocation
> you can just share with us? Thinks would get easier with an MMU, but a

So far I mainly read those papers, that you can find about this topic
via Google. Just to get a feeling for what might be possible.

An MMU would make this easier, but in that case there are even more
advanced possibilities available like dynamic stack growth.

> well-working shared stack could be an interesting option as well even
> though not easy to implement and perhaps even more error prone to
> bufferoverfloaws.

May or may not, it depends on the implementation. Consider a very simple
one.

We allocate a large stack to be shared among several threads. On context
switch, the currently used part of the stack of the terminating thread
is saved somewhere else and refilled with the previously saved stack
contents of the woken thread. In this case, all threads share the same
large stack, which is much safer than small individual stacks.

Of course, such a simple algorithm would really heavily increase context
switching times. But anyway, some applications may even benefit from
such a simple version. Note, that the size of the saved stack is
typically much smaller than what needs to be reserved currently for the
worst case for each thread. The required memory copy would harm too much
as long as interrupts are enabled during the transfers.

Of course, a number of refinements are possible, if you introduce
specific limitations. In any case, this is no solution for all threads.
But you may define thread groups, where members of a group share the
same stack.

Then you can, again for example, limit level of blocking calls. Many
thread, like the DHCP thread, will work sufficiently well, if blocking
calls are limited to the main thread routine.

thread() {
  for (;;) {
    blocking_receive();
    processing();
    blocking_transmit();
  }
}

If processing does not contain any blocking calls, the stack will be
always (almost) empty on context switches. Actually it is the same as
re-running and terminating a thread without the endless loop, but you
save the need for thread creation and termination.

Hope you could follow my lousy explanations.

Regards,

Harald

_______________________________________________
http://lists.egnite.de/mailman/listinfo/en-nut-discussion