Quote:
Gunnar, you don't use that register in your next instructions, there is a 4-5 instruction gap between updating a2 or a3 and the next use of a2 or a3 (depending on where you start counting), so how is this an optimization? :D
I was providing information on general advantages of using LEA over ADDI. Do you want me to provide additional information, yes or no?
Quote:
I hardly think this makes a lot of difference to the performance of the function. You saved 4 bytes..
Its 4 Bytes in 2 instructions.
There are many more bytes wastes in the other rows.
Maybe just looking at the ASM instructions does it not make so clear. I'll include the HEX dump to make this clearer:
GCC Compile example:
C-source
Code:
void * copy_32x4(void *destparam, const void *srcparam, size_t size)
{
int d1,d2,d3,d4;
int *dest = destparam;
const int *src = srcparam;
int size32;
size32 = size / 16;
for (; size32; size32--) {
d1 = *src++;
d2 = *src++;
d3 = *src++;
d4 = *src++;
*dest++ = d1;
*dest++ = d2;
*dest++ = d3;
*dest++ = d4;
}
}
m68k-linux-gnu-gcc -mcpu=54455 -o out -O2 example.c
<copy_32x4>:
01 4e56 fff4 _____ _____ linkw %fp,#-12
02 48d7 0c04 _____ _____ moveml %d2/%a2-%a3,%sp@
03 202e 0010 _____ _____ movel %fp@(16),%d0
04 e888 _____ _____ _____ lsrl #4,%d0
05 4a80 _____ _____ _____ tstl %d0 # not needed!
06 6736 _____ _____ _____ beqs 22
07 266e 0008 _____ _____ moveal %fp@(8),%a3
08 246e 000c _____ _____ moveal %fp@(12),%a2
09 2400 _____ _____ _____ movel %d0,%d2 # not needed!
10 2012 _____ _____ _____ movel %a2@,%d0
11 222a 0004 _____ _____ movel %a2@
(4),%d1
12 206a 0008 _____ _____ moveal %a2@
(8),%a0
13 226a 000c _____ _____ moveal %a2@
(12),%a1
14 d5fc 0000 0010 _____ addal #16,%a2
15 2680 _____ _____ _____ movel %d0,%a3@
16 2741 0004 _____ _____ movel %d1,%a3@
(4)
17 2748 0008 _____ _____ movel %a0,%a3@
(8)
18 2749 000c _____ _____ movel %a1,%a3@
(12)
19 d7fc 0000 0010 _____ addal #16,%a3
20 5382 _____ _____ _____ subql #1,%d2
21 66d4 _____ _____ _____ bnes 10
22 4cd7 0c04 _____ _____ moveml %sp@,%d2/%a2-%a3
23 4e5e _____ _____ _____ unlk %fp
24 4e75 _____ _____ _____ rts
I've
marked in red deficiencies in the created code. This created code is slower and 28 bytes longer than needed.
I think there are three problems with the above code:
1st)
line
05 4a80
tstl %d0
Why does GCC generate the tst instruction?
2nd)
line
09 2400 movel %d0,%d2
There seems to be major bug in way the registers are allocated by GCC. I saw at many places that GCC loaded a value in in D0 only to move it to another register later.
3rd)
lines 11-19
The copy does not use (an)+ addressing mode.
Why not?
A (An)+ move is 2 bytes shorter per move.
And the extra add instructions would not be needed.
The Coldfire is an embedded CPU. A big advantage of the Coldfire is its compact code. Unfortunately GCC seems to nullify this completely.
Am I overlooking just something here?
Maybe there is way to get GCC produce better code.
Can someone help out?