[En-Nut-Discussion] Using the FPU on Cortex-M4

Fri Oct 24 09:30:33 CEST 2014

Hi everyone,

as noted earlier, I'm currently porting Nut/OS to the TIVA TM4C
architecture (see branches/devnut_tiva). It works well so far, as I can
take a lot of code from the LM3S branch. Right now, the operating system
itself, the GPIO API and the UART devices are working.

When trying out the UART example, I've seen that this code also uses
floats (if enabled). This also works fine, but it looks like the FPU of
the Cortex-M4 is only half-heartedly used. After modifying the part at
the end of main() to use a float instead of a double literal

#ifdef STDIO_FLOATING_POINT
        dval += 1.0125f;  // dval is a float, actually
        fprintf(uart, "FP %f\n", dval);
#endif

the executable size decreases by some bytes, but still weighs almost
25kB. When looking at the disassembly, the end of main() (generated from
the code cited above) looks like this:

; ...
     172:       eddf 7a11       vldr    s15, [pc, #68]  ; 1b8 <main+0x104>
     176:       ee38 8a27       vadd.f32        s16, s16, s15
     17a:       ee18 0a10       vmov    r0, s16
     17e:       f002 fb99       bl      28b4 <__aeabi_f2d>
     182:       4602            mov     r2, r0
     184:       460b            mov     r3, r1
     186:       4620            mov     r0, r4
     188:       490c            ldr     r1, [pc, #48]   ; (1bc <main+0x108>)
     18a:       f000 f912       bl      3b2 <fprintf>
     18e:       e7d1            b.n     134 <main+0x80>
; ...

According to the vldr, vadd and vmov instructions, the compiler is aware
of the FPU and makes use of it. This is fine. When listing the defined
symbols along with the size of the symbols however, I see that many big
functions are generated for floating-point calculations:

$ arm-none-eabi-nm --print-size --size-sort --radix=d uart.elf
# ...
00020844 00000464 T __aeabi_ddiv
00020844 00000464 T __divdf3
536870912 00000512 b g_pfnRAMVectors
00020248 00000596 T __aeabi_dmul
00020248 00000596 T __muldf3
00007116 00000606 T UsartIOCtl
00009720 00000630 T __adddf3
00009720 00000630 T __aeabi_dadd
00009716 00000634 T __aeabi_dsub
00009716 00000634 T __subdf3
536872796 00001032 D __malloc_av_
536871728 00001064 d impure_data
00001116 00001276 T _putf
00018244 00001378 T _malloc_r
00010888 00003918 T _dtoa_r
# ...

Let's take __adddf3 as an example. I don't know exactly what this
function does, but I suppose it calculates the addition of a float and a
double. This should be fairly easy with an FPU, but it takes up 630
bytes. Looking at the disassembly of this function, I can't see a single
floating-point instruction. Why is that the case, or how can I change
it? I have checked the "Enable FPU support" option in the configurator
under Architecture->CM3->FPU support (Cortex M4) and the compiler flags
also look reasonable to me:

# Compilation of fprintf.c
08:30:56: arm-none-eabi-gcc -c
-I/home/phip/phipsfiles/developing/ethernut/nutbld-fpm_01b/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include/contrib
 -DFPM_01B  -MD -MP -mcpu=cortex-m4 -mthumb -D__CORTEX__
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -ffunction-sections -fdata-sections
-fomit-frame-pointer  -Os -Wall -Wstrict-prototypes -Werror
-Wa,-a=fprintf.lst   -o fprintf.o
/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/crt/fprintf.c

# Archiving of libnutcrt.a
08:30:57: arm-none-eabi-ar rsc libnutcrt.a close.o clrerr.o ioctl.o
open.o select.o getf.o read.o putf.o write.o fclose.o fcloseall.o
fdopen.o feof.o ferror.o fflush.o filelength.o fileno.o flushall.o
fmode.o fopen.o fpurge.o freopen.o fseek.o ftell.o funopen.o seek.o
tell.o fgetc.o fgets.o fread.o fscanf.o getc.o getchar.o gets.o kbhit.o
scanf.o ungetc.o vfscanf.o fprintf.o fputc.o fputs.o fwrite.o printf.o
putc.o putchar.o puts.o vfprintf.o asprintf.o sprintf.o snprintf.o
sscanf.o vasprintf.o vsprintf.o vsnprintf.o vsscanf.o vis.o unvis.o
gmtime.o localtime.o asctime.o mktime.o time.o timeofday.o tzset.o
errno.o calloc.o calloc_dbg.o malloc.o malloc_dbg.o realloc.o
realloc_dbg.o strdup.o strdup_dbg.o sbrk.o getenv.o putenv.o setenv.o
environ.o

# Compilation and linking of the application (uart.c)
arm-none-eabi-gcc -c
-I/home/phip/phipsfiles/developing/ethernut/nutbld-fpm_01b/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include/contrib
 -DFPM_01B  -MD -MP -mcpu=cortex-m4 -mthumb -D__CORTEX__
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -ffunction-sections -fdata-sections
-fomit-frame-pointer  -Os -Wall -Wstrict-prototypes -Wa,-a=uart.lst   -o
uart.o uart.c
arm-none-eabi-gcc uart.o -mcpu=cortex-m4 -mthumb -D__CORTEX__
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -nostartfiles
-L/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/arch/cm3/ldscripts
-Ttm4c1294ncpdt_flash.ld -Wl,-Map=uart.map,--cref,--gc-sections
-L/home/phip/phipsfiles/developing/ethernut/nutinstall-fpm_01b
-Wl,--start-group
/home/phip/phipsfiles/developing/ethernut/nutinstall-fpm_01b/nutinit.o
-lnutcrt -lnutarch -lnutdev -lnutos -lnutdev -lnutarch  -Wl,--end-group
-o uart.elf
arm-none-eabi-objcopy  -O ihex uart.elf uart.hex
arm-none-eabi-objcopy  -O binary uart.elf uart.bin

Can anyone tell me what to do with this? It's not a big problem at the
moment, program size is not an issue anyway (1024kiB flash), but I
suppose that invoking such functions is also quite slow. Maybe it just
happens in fprintf(), where it doesn't really matter, but I have the
feeling that the compiler could do better.

Thanks!

Best regards,
Philipp