[En-Nut-Discussion] ARM GCC 4.4 Alignment Problems

Bernd Walter enut at cicely.de
Fri Apr 2 21:01:42 CEST 2010


On Fri, Apr 02, 2010 at 05:04:04PM +0200, Harald Kipp wrote:
> Hi Bernd,
> 
> It's always a pleasure to see you jumping in when things become a bit
> complicated.
> 
> On 01.04.2010 20:06, Bernd Walter wrote:
> > I'm not so much a fan about missaligned data.
> 
> Who is? ;-)

That's a good question.

> > In almost every case it is avoidable without much trouble.
> > One of the most annoying points with network data is the 6 bytes long
> > ethernet header, which usually garanties missaligned IP headers, but
> > usually this is avoided by setting RX buffers at a 2 byte offset or
> > copying the data before parsing if HW require 32bit aligned DMA RX
> > buffers - AT91SAM7X should be happy with 2 byte offset, AT91RM9200
> > are not and require copying.
> 
> Using a 2 byte offset is indeed an option worth to be evaluated.
> Copying, however, is something that really consumes CPU power.

So does code for accessing potentially missaligned data.
You either load an aligned 32 bit value into a register or load two
partial data into two registers, then shift and aggregate them.
 1 instruction, 1 memory access
or
 5 instructions, 2 memory accesses and one scratchpad register.
Writing even requires read-modify-write cycles.
Some architecture can access missaligned data directly, but have
to do the additional memory cycles as well and fix them in hardware.
On systems with data cache copying is cheap compared to code bloat,
because it accesses data within same cacheline and uses burst access
to memory.
Copying is often the cheaper option.

> > Packed on ARM has different reasons - older ARM CPUs required bytes
> > and 16bit words to be on the same alignment because they had to
> > mask bytes out from 32bit memory operations and therefor structs
> > containing 3 bytes are 4 bytes long, so that an array of those structs
> > always start the same members at the same 32bit offset.
> 
> Not a real problem, but of course additional instructions are required
> compared to 32-bit aligned elements.
> 
> 
> > Usually this isn't a problem and it is also Ok with C-standards, but
> > parsing network data with structs can be a problem.
> > Since we don't need ABI compatibility with older ARM systems it
> > shouldn't be a problem to use -mstructure-size-boundary=8 (bits)
> > and don't use packed.
> 
> This option will reduce the total size of a structure, but it will not
> pack its members.
> 
>   typedef struct __attribute__((packed)) ether_header {
>     uint8_t  ether_dhost[ETHER_ADDR_LEN];
>     uint8_t  ether_shost[ETHER_ADDR_LEN];
>     uint16_t ether_type;
>   } ETHERHDR;
> 
>   struct __attribute__((packed)) frame {
>     ETHERHDR hdr;
>     uint32_t data[8];
>   };
> 
> ETHERHDR is 14 bytes and struct frame is 8 * 4 + 14 = 46 bytes.

Yes - because it is packed.
This is on a FreeBSD arm system:
#include <inttypes.h>
#include <stdio.h>

int main()
{
        struct {
                uint8_t ether1[6];
                uint8_t ether2[6];
                uint16_t ethertype;
        } testvar;

        printf("sizeof(testvar): %i\n", sizeof(testvar));
        printf("offset ether1: %i\n", (int)&testvar.ether1 - (int)&testvar);
        printf("offset ether2: %i\n", (int)&testvar.ether2 - (int)&testvar);
        printf("offset ethertype: %i\n", (int)&testvar.ethertype - (int)&testvar);
        return 0;
}

[63]chipmunk.cicely.de# gcc -o test test.c
2.000u 0.000s 0:04.29 81.5%     31417+5528k 0+0io 0pf+0w
[64]chipmunk.cicely.de# ./test
sizeof(testvar): 16
offset ether1: 0
offset ether2: 6
offset ethertype: 12

ether1 is 6 bytes.
ether2 is 6 bytes and since it contains bytes it has an alignment of 1
and starts directly after ether1 at an offset of 6.
ether2 is the same size as ether1.
ethertype starts at an offset of 12 (6 + 6) because it has an alignment
of 2, which fits.
The complete size however isn't 14 because the whole size is padded up to
n*4 size, so that an array of such a struct has every element startet
4 byte aligned.
This is special to ARM and the case because of old processors, which
couldn't natively address bytes and words - they masked them and the
masking code needed to know the concrete offsets.
Early alpha systems had the same restriction but offered special mask
commands to avoid this problem, so the padding is unique to ARM.

Modern ARM don't have this restriction, so it possible to avoi it:
[65]chipmunk.cicely.de# gcc -mstructure-size-boundary=8 -o test test.c
2.000u 0.000s 0:04.21 82.6%     30521+5412k 0+0io 0pf+0w
[66]chipmunk.cicely.de# ./test
sizeof(testvar): 14
offset ether1: 0
offset ether2: 6
offset ethertype: 12

About you second struct.
Lets extend our testprogram:
#include <inttypes.h>
#include <stdio.h>

int main()
{
        struct tv {
                uint8_t ether1[6];
                uint8_t ether2[6];
                uint16_t ethertype;
        };
        struct tv testvar;

        printf("sizeof(testvar): %i\n", sizeof(testvar));
        printf("offset ether1: %i\n", (int)&testvar.ether1 - (int)&testvar);
        printf("offset ether2: %i\n", (int)&testvar.ether2 - (int)&testvar);
        printf("offset ethertype: %i\n", (int)&testvar.ethertype - (int)&testvar);

        struct {
                struct tv hdr;
                uint32_t data[8];
        } testvar2;

        printf("sizeof(testvar2): %i\n", sizeof(testvar2));
        printf("offset hdr: %i\n", (int)&testvar2.hdr - (int)&testvar2);
        printf("offset data: %i\n", (int)&testvar2.data - (int)&testvar2);

        return 0;
}

[82]chipmunk.cicely.de# gcc -o test test.c
2.000u 0.000s 0:04.37 80.0%     31021+5474k 0+0io 0pf+0w
[83]chipmunk.cicely.de# ./test
sizeof(testvar): 16
offset ether1: 0
offset ether2: 6
offset ethertype: 12
sizeof(testvar2): 48
offset hdr: 0
offset data: 16

[84]chipmunk.cicely.de# gcc -mstructure-size-boundary=8 -o test test.c
2.000u 0.000s 0:04.25 82.5%     30833+5453k 0+0io 0pf+0w
[85]chipmunk.cicely.de# ./test
sizeof(testvar): 14
offset ether1: 0
offset ether2: 6
offset ethertype: 12
sizeof(testvar2): 48
offset hdr: 0
offset data: 16

If sizeof(ether_header) is 16 instead of 14 it allocates 16 bytes
within your struct.
Then you add an array of 32bit values.
In case of a 16 bytes there is no problem with it because data starts
at the natural aligned position for 32bit values.
If it is 14 bytes the 32bit values require 2 byte padding for alignment.
The size is the same, although the way is different.
The second case will also happen on other architecture with alignment
requirements.

So what are you soing with packed.
You tell that the structure has no padding at all.
Both cases are dropped and data is missaligned.
All access to data needs special code overhead to deal with it.
Code size increases, speed drops because of 2 byte memory savings.
It is a different point if you need to parse data handed over by
other systems, but in this case you also need to deal with byte order.

What happens with padding if we reorder hdr and data:
[90]chipmunk.cicely.de# ./test
sizeof(testvar): 14
offset ether1: 0
offset ether2: 6
offset ethertype: 12
sizeof(testvar2): 48
offset hdr: 32
offset data: 0

Same size, but this time not because of ARM alignment requirements,
but because of uint32_t requirements want sizeof to be 4*n, so
-mstructure-size-boundary=8 won't help.

In other cases it might help.
E.g.:
struct xxx {
	uint16_t foo1; // requires 2byte alignment and 2*n sizeof
	uint8_t foo2; // no special requirement
	uint16_t foo3; // requires 1 byte alignment padding in front for 2 byte alignment
	uint8_t foo4; // no special requirement
	// 1 bytes passing for sizeof 2*n requirement of int16_t's
} // sizeof = 8
and
struct xxx{
	uint16_t foo1; // requires 2byte alignment and 2*n sizeof
	uint16_t foo3; // requires 2byte alignment and 2*n sizeof
	uint8_t foo2; // no special requirement
	uint8_t foo4; // no special requirement
} // sizeof = 6 with -mstructure-size-boundary=8 or 8 without

> When removing packed and instead compiling with
> -mstructure-size-boundary=8, then ETHERHDR is still 14 bytes only, but
> struct frame will grow by 2 bytes, because data[] will become aligned.

Yes - but why don't you want it to be aligned?
It is a 32bit variable after all and not 4 char.

> The problem is not the size of structures, but the alignment of their
> members.

Yes it is, but then again, why don't you want them to be aligned.
See my first statement: I'm not a fan of missaligned data.

> Please correct me if I'm wrong, I'm just evaluating this stuff.

You are right, but there is a reason for the defaults.

-- 
B.Walter <bernd at bwct.de> http://www.bwct.de
Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm.



More information about the En-Nut-Discussion mailing list