I'm at the beginning with my test on Coldfire, so it too early for a complete analyze.
I might be overlooking something but to me it looks like the Coldfire code generation in GCC 4 is badly broken.
Here is an example of a very simple loop copying 16 byte.
GCC 4 creates terrible code for Coldfire.
Example GCC4:
Code:
copy_32x4C:
01 link.w %fp,#-16
02 movm.l #3084,(%sp)
03 move.l 8(%fp),%d3
04 move.l 16(%fp),%d0
05 lsr.l #4,%d0
06 tst.l %d0
07 jbeq .L148
08 move.l %d3,%a3
09 move.l 12(%fp),%a2
10 move.l %d0,%d2
.L150:
11 move.l (%a2),%d0
12 move.l 4(%a2),%d1
13 move.l 8(%a2),%a0
14 move.l 12(%a2),%a1
15 add.l #16,%a2
16 move.l %d0,(%a3)
17 move.l %d1,4(%a3)
18 move.l %a0,8(%a3)
19 move.l %a1,12(%a3)
20 add.l #16,%a3
21 subq.l #1,%d2
22 jbne .L150
.L148:
23 movm.l (%sp),#3084
24 unlk %fp
25 rts
Several problem are visible:
GCC does load variables in certain register only to mcopy them to other register later.
Lines 08 and 10 are useless.
Line 06 the test instruction is not needed.
Lines 11-21 the copy loop is very bad written.
If the loop would use (an)+ addressing mode it would be 24 bytes shorter and would not need the 2 add instructions.
What the GCC 4 actually wanted to do and what GCC 2 was doing is this:
Code:
copy_32x4C:
link.w %fp,#-16
movm.l #3084,(%sp)
move.l 8(%fp),%a3
move.l 16(%fp),%d2
lsr.l #4,%d2
jbeq .L148
move.l 12(%fp),%a2
.L150:
move.l (%a2)+,%d0
move.l (%a2)+,%d1
move.l (%a2)+,%a0
move.l (%a2)+,%a1
move.l %d0,(%a3)+
move.l %d1,(%a3)+
move.l %a0,(%a3)+
move.l %a1,(%a3)+
subq.l #1,%d2
jbne .L150
.L148:
movm.l (%sp),#3084
unlk %fp
rts
And what I would write is this:
Code:
copy_32x4C:
link.w %fp,#-16
movm.l #3084,(%sp)
move.l 8(%fp),%a3
move.l 16(%fp),%d2
lsr.l #4,%d2
jbeq .L148
move.l 12(%fp),%a2
.L150:
movem.l (%a2),%d0/%d1/%a0/%a1
lea (16,%a2),%a2
movem.l %d0/%d1/%a0/%a1,(%a3)
lea (16,%a3),%a3
subq.l #1,%d2
jbne .L150
.L148:
movm.l (%sp),#3084
unlk %fp
rts
Using movem.l is shorter and faster.
The routine created by GCC is were bad.
Is visible that GCC is major problems with register allocation.
The middle routine is a lot better its halve as short and faster :-/
And the lowest routine is again much shorter and twice as fast as the GCC4 version.
I'm wondering how much better the Coldfire will look in embedded benchmarks if the Compiler would be less stupid.