[En-Nut-Discussion] Using the FPU on Cortex-M4
Philipp Burch
phip at hb9etc.ch
Fri Oct 24 09:30:33 CEST 2014
Hi everyone,
as noted earlier, I'm currently porting Nut/OS to the TIVA TM4C
architecture (see branches/devnut_tiva). It works well so far, as I can
take a lot of code from the LM3S branch. Right now, the operating system
itself, the GPIO API and the UART devices are working.
When trying out the UART example, I've seen that this code also uses
floats (if enabled). This also works fine, but it looks like the FPU of
the Cortex-M4 is only half-heartedly used. After modifying the part at
the end of main() to use a float instead of a double literal
#ifdef STDIO_FLOATING_POINT
dval += 1.0125f; // dval is a float, actually
fprintf(uart, "FP %f\n", dval);
#endif
the executable size decreases by some bytes, but still weighs almost
25kB. When looking at the disassembly, the end of main() (generated from
the code cited above) looks like this:
; ...
172: eddf 7a11 vldr s15, [pc, #68] ; 1b8 <main+0x104>
176: ee38 8a27 vadd.f32 s16, s16, s15
17a: ee18 0a10 vmov r0, s16
17e: f002 fb99 bl 28b4 <__aeabi_f2d>
182: 4602 mov r2, r0
184: 460b mov r3, r1
186: 4620 mov r0, r4
188: 490c ldr r1, [pc, #48] ; (1bc <main+0x108>)
18a: f000 f912 bl 3b2 <fprintf>
18e: e7d1 b.n 134 <main+0x80>
; ...
According to the vldr, vadd and vmov instructions, the compiler is aware
of the FPU and makes use of it. This is fine. When listing the defined
symbols along with the size of the symbols however, I see that many big
functions are generated for floating-point calculations:
$ arm-none-eabi-nm --print-size --size-sort --radix=d uart.elf
# ...
00020844 00000464 T __aeabi_ddiv
00020844 00000464 T __divdf3
536870912 00000512 b g_pfnRAMVectors
00020248 00000596 T __aeabi_dmul
00020248 00000596 T __muldf3
00007116 00000606 T UsartIOCtl
00009720 00000630 T __adddf3
00009720 00000630 T __aeabi_dadd
00009716 00000634 T __aeabi_dsub
00009716 00000634 T __subdf3
536872796 00001032 D __malloc_av_
536871728 00001064 d impure_data
00001116 00001276 T _putf
00018244 00001378 T _malloc_r
00010888 00003918 T _dtoa_r
# ...
Let's take __adddf3 as an example. I don't know exactly what this
function does, but I suppose it calculates the addition of a float and a
double. This should be fairly easy with an FPU, but it takes up 630
bytes. Looking at the disassembly of this function, I can't see a single
floating-point instruction. Why is that the case, or how can I change
it? I have checked the "Enable FPU support" option in the configurator
under Architecture->CM3->FPU support (Cortex M4) and the compiler flags
also look reasonable to me:
# Compilation of fprintf.c
08:30:56: arm-none-eabi-gcc -c
-I/home/phip/phipsfiles/developing/ethernut/nutbld-fpm_01b/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include/contrib
-DFPM_01B -MD -MP -mcpu=cortex-m4 -mthumb -D__CORTEX__
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -ffunction-sections -fdata-sections
-fomit-frame-pointer -Os -Wall -Wstrict-prototypes -Werror
-Wa,-a=fprintf.lst -o fprintf.o
/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/crt/fprintf.c
# Archiving of libnutcrt.a
08:30:57: arm-none-eabi-ar rsc libnutcrt.a close.o clrerr.o ioctl.o
open.o select.o getf.o read.o putf.o write.o fclose.o fcloseall.o
fdopen.o feof.o ferror.o fflush.o filelength.o fileno.o flushall.o
fmode.o fopen.o fpurge.o freopen.o fseek.o ftell.o funopen.o seek.o
tell.o fgetc.o fgets.o fread.o fscanf.o getc.o getchar.o gets.o kbhit.o
scanf.o ungetc.o vfscanf.o fprintf.o fputc.o fputs.o fwrite.o printf.o
putc.o putchar.o puts.o vfprintf.o asprintf.o sprintf.o snprintf.o
sscanf.o vasprintf.o vsprintf.o vsnprintf.o vsscanf.o vis.o unvis.o
gmtime.o localtime.o asctime.o mktime.o time.o timeofday.o tzset.o
errno.o calloc.o calloc_dbg.o malloc.o malloc_dbg.o realloc.o
realloc_dbg.o strdup.o strdup_dbg.o sbrk.o getenv.o putenv.o setenv.o
environ.o
# Compilation and linking of the application (uart.c)
arm-none-eabi-gcc -c
-I/home/phip/phipsfiles/developing/ethernut/nutbld-fpm_01b/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include
-I/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/include/contrib
-DFPM_01B -MD -MP -mcpu=cortex-m4 -mthumb -D__CORTEX__
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -ffunction-sections -fdata-sections
-fomit-frame-pointer -Os -Wall -Wstrict-prototypes -Wa,-a=uart.lst -o
uart.o uart.c
arm-none-eabi-gcc uart.o -mcpu=cortex-m4 -mthumb -D__CORTEX__
-mfloat-abi=hard -mfpu=fpv4-sp-d16 -nostartfiles
-L/home/phip/phipsfiles/developing/ethernut/devnut_tiva/nut/arch/cm3/ldscripts
-Ttm4c1294ncpdt_flash.ld -Wl,-Map=uart.map,--cref,--gc-sections
-L/home/phip/phipsfiles/developing/ethernut/nutinstall-fpm_01b
-Wl,--start-group
/home/phip/phipsfiles/developing/ethernut/nutinstall-fpm_01b/nutinit.o
-lnutcrt -lnutarch -lnutdev -lnutos -lnutdev -lnutarch -Wl,--end-group
-o uart.elf
arm-none-eabi-objcopy -O ihex uart.elf uart.hex
arm-none-eabi-objcopy -O binary uart.elf uart.bin
Can anyone tell me what to do with this? It's not a big problem at the
moment, program size is not an issue anyway (1024kiB flash), but I
suppose that invoking such functions is also quite slow. Maybe it just
happens in fprintf(), where it doesn't really matter, but I have the
feeling that the compiler could do better.
Thanks!
Best regards,
Philipp
More information about the En-Nut-Discussion
mailing list