All times are UTC-06:00




Post new topic  Reply to topic  [ 18 posts ] 
Author Message
PostPosted: Sat Apr 12, 2008 7:38 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Matt,

you are wrong in some technical facts here.
I'll give some explanation.
Quote:
Why do you need to use lea to add 16 to a value in a register? Isn't addi.l simply better here
You would always use LEA here, as LEA is much smaller.
Both LEA and ADDI.L produce the same result.
LEA is 4 Bytes long. ADDI.L is 6 Bytes long.
LEA saves you 2 bytes.
I hardly think this makes a lot of difference to the performance of the function. You saved 4 bytes.. did your project become a 4k demo intro?

It also doesn't show how you determine this would be better just by specifying dst += 16 in C. Really, how do you imagine that adding 16 to a value is converted to "load effective address", conceptually, without having prior knowledge of the purpose of the function as a whole? A C backend cannot know this.
Quote:
In addition to this LEA also makes it result available to subsequent instructions. This means you can use the register effected by LEA with the next instruction without getting a latency hit in your pipelined.
Gunnar, you don't use that register in your next instructions, there is a 4-5 instruction gap between updating a2 or a3 and the next use of a2 or a3 (depending on where you start counting), so how is this an optimization? :D
Quote:
Quote:
movem.l will move these registers in *reverse* order.
WORNG!
The order of the registers is from data register 0 to data register 7, then from address register 0 to address register 7.
Okay you got me there. My head is stuck in the clouds with real 680x0 and decrement modes which do not exist on ColdFire.. that stack example I made was a bad one :D

_________________
Matt Sealey


Top
   
PostPosted: Sat Apr 12, 2008 8:35 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Gunnar, you don't use that register in your next instructions, there is a 4-5 instruction gap between updating a2 or a3 and the next use of a2 or a3 (depending on where you start counting), so how is this an optimization? :D
I was providing information on general advantages of using LEA over ADDI. Do you want me to provide additional information, yes or no?

Quote:
I hardly think this makes a lot of difference to the performance of the function. You saved 4 bytes..
Its 4 Bytes in 2 instructions.
There are many more bytes wastes in the other rows.
Maybe just looking at the ASM instructions does it not make so clear. I'll include the HEX dump to make this clearer:

GCC Compile example:

C-source
Code:
void * copy_32x4(void *destparam, const void *srcparam, size_t size)
{
int d1,d2,d3,d4;
int *dest = destparam;
const int *src = srcparam;
int size32;
size32 = size / 16;
for (; size32; size32--) {
d1 = *src++;
d2 = *src++;
d3 = *src++;
d4 = *src++;
*dest++ = d1;
*dest++ = d2;
*dest++ = d3;
*dest++ = d4;
}
}

m68k-linux-gnu-gcc -mcpu=54455 -o out -O2 example.c


<copy_32x4>:

01 4e56 fff4 _____ _____ linkw %fp,#-12
02 48d7 0c04 _____ _____ moveml %d2/%a2-%a3,%sp@
03 202e 0010 _____ _____ movel %fp@(16),%d0
04 e888 _____ _____ _____ lsrl #4,%d0
05 4a80 _____ _____ _____ tstl %d0 # not needed!
06 6736 _____ _____ _____ beqs 22
07 266e 0008 _____ _____ moveal %fp@(8),%a3
08 246e 000c _____ _____ moveal %fp@(12),%a2
09 2400 _____ _____ _____ movel %d0,%d2 # not needed!
10 2012 _____ _____ _____ movel %a2@,%d0
11 222a 0004 _____ _____ movel %a2@(4),%d1
12 206a 0008 _____ _____ moveal %a2@(8),%a0
13 226a 000c _____ _____ moveal %a2@(12),%a1
14 d5fc 0000 0010 _____ addal #16,%a2
15 2680 _____ _____ _____ movel %d0,%a3@
16 2741 0004 _____ _____ movel %d1,%a3@(4)
17 2748 0008 _____ _____ movel %a0,%a3@(8)
18 2749 000c _____ _____ movel %a1,%a3@(12)
19 d7fc 0000 0010 _____ addal #16,%a3
20 5382 _____ _____ _____ subql #1,%d2
21 66d4 _____ _____ _____ bnes 10
22 4cd7 0c04 _____ _____ moveml %sp@,%d2/%a2-%a3
23 4e5e _____ _____ _____ unlk %fp
24 4e75 _____ _____ _____ rts


I've marked in red deficiencies in the created code. This created code is slower and 28 bytes longer than needed.

I think there are three problems with the above code:

1st)
line 05 4a80
tstl %d0
Why does GCC generate the tst instruction?

2nd)
line 09 2400 movel %d0,%d2
There seems to be major bug in way the registers are allocated by GCC. I saw at many places that GCC loaded a value in in D0 only to move it to another register later.

3rd)
lines 11-19
The copy does not use (an)+ addressing mode.
Why not?
A (An)+ move is 2 bytes shorter per move.
And the extra add instructions would not be needed.


The Coldfire is an embedded CPU. A big advantage of the Coldfire is its compact code. Unfortunately GCC seems to nullify this completely.


Am I overlooking just something here?
Maybe there is way to get GCC produce better code.
Can someone help out?


Top
   
PostPosted: Sat Apr 12, 2008 9:37 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi
Quote:
Quote:
I might be overlooking something but to me it looks like the Coldfire code generation in GCC 4 is badly broken.

Here is an example of a very simple loop copying 16 byte.
Please, please provide the code you used to create the resulting assembler. We want results that we can reproduce!
I've added an example for you.
BTW the Problems with with doing extra TST instructions or the problem with loading variables first into register Dx only to move it to another register later are consistent in all the created code that I saw.

Quote:
You might want to try out the different optimize options from gcc (-O2 vs. -O3 vs. -Os). This can make a huge difference!
Yes I did.
I tried different cpu setting ranging from -mcpu=68040 , -mcpu=68060, to various CF CPUs. I tried O2, O3 and Os.

Quote:
I'm not saying gcc is good/bad. coldfire is a niche architecture and you cannot expect the same code optimizations as you see on x86.
I dare to disagree with you here.

If the problems would be specific to an instruction of a new Coldfire model, then I would agree - but they are not.
The problems are of general nature.
GCC does a lots of unneeded TST instruction. Why does it do this?
GCC seems to have problems allocating registers.
And GCC does not use the 68k addressing modes efficiently anymore.

GCC used to create much better 68k code before.

68k and Coldfire are not new or rare CPUs
The Coldfire family is 15 years old now and the 68k family much older.

I think there is some room for improvement.
But maybe I'm just got a bad compiler sub-version or misusing it?


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 18 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group