All times are UTC-06:00




Post new topic  Reply to topic  [ 18 posts ] 
Author Message
 Post subject: Coldfire GCC performance
PostPosted: Fri Apr 04, 2008 1:08 pm 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
I'm at the beginning with my test on Coldfire, so it too early for a complete analyze.

I might be overlooking something but to me it looks like the Coldfire code generation in GCC 4 is badly broken.

Here is an example of a very simple loop copying 16 byte.

GCC 4 creates terrible code for Coldfire.

Example GCC4:
Code:
copy_32x4C:
01 link.w %fp,#-16
02 movm.l #3084,(%sp)
03 move.l 8(%fp),%d3
04 move.l 16(%fp),%d0
05 lsr.l #4,%d0
06 tst.l %d0
07 jbeq .L148
08 move.l %d3,%a3
09 move.l 12(%fp),%a2
10 move.l %d0,%d2
.L150:
11 move.l (%a2),%d0
12 move.l 4(%a2),%d1
13 move.l 8(%a2),%a0
14 move.l 12(%a2),%a1
15 add.l #16,%a2
16 move.l %d0,(%a3)
17 move.l %d1,4(%a3)
18 move.l %a0,8(%a3)
19 move.l %a1,12(%a3)
20 add.l #16,%a3
21 subq.l #1,%d2
22 jbne .L150
.L148:
23 movm.l (%sp),#3084
24 unlk %fp
25 rts
Several problem are visible:
GCC does load variables in certain register only to mcopy them to other register later.
Lines 08 and 10 are useless.
Line 06 the test instruction is not needed.
Lines 11-21 the copy loop is very bad written.
If the loop would use (an)+ addressing mode it would be 24 bytes shorter and would not need the 2 add instructions.



What the GCC 4 actually wanted to do and what GCC 2 was doing is this:
Code:
copy_32x4C:
link.w %fp,#-16
movm.l #3084,(%sp)
move.l 8(%fp),%a3
move.l 16(%fp),%d2
lsr.l #4,%d2
jbeq .L148
move.l 12(%fp),%a2
.L150:
move.l (%a2)+,%d0
move.l (%a2)+,%d1
move.l (%a2)+,%a0
move.l (%a2)+,%a1
move.l %d0,(%a3)+
move.l %d1,(%a3)+
move.l %a0,(%a3)+
move.l %a1,(%a3)+
subq.l #1,%d2
jbne .L150
.L148:
movm.l (%sp),#3084
unlk %fp
rts


And what I would write is this:
Code:
copy_32x4C:
link.w %fp,#-16
movm.l #3084,(%sp)
move.l 8(%fp),%a3
move.l 16(%fp),%d2
lsr.l #4,%d2
jbeq .L148
move.l 12(%fp),%a2
.L150:
movem.l (%a2),%d0/%d1/%a0/%a1
lea (16,%a2),%a2
movem.l %d0/%d1/%a0/%a1,(%a3)
lea (16,%a3),%a3
subq.l #1,%d2
jbne .L150
.L148:
movm.l (%sp),#3084
unlk %fp
rts
Using movem.l is shorter and faster.


The routine created by GCC is were bad.
Is visible that GCC is major problems with register allocation.

The middle routine is a lot better its halve as short and faster :-/

And the lowest routine is again much shorter and twice as fast as the GCC4 version.


I'm wondering how much better the Coldfire will look in embedded benchmarks if the Compiler would be less stupid.


Top
   
PostPosted: Sat Apr 05, 2008 5:48 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
I'm at the beginning with my test on Coldfire, so it too early for a complete analyze.
O U C H. But we all knew GCC 4 sucked anyway. It completely fails to produce manageable PowerPC code too, with the same problems.
Quote:
I'm wondering how much better the Coldfire will look in embedded benchmarks if the Compiler would be less stupid.
Maybe we should look into getting a better compiler. Embedded systems development rarely focusses on GCC. I've always had to use Wind River's (DIAB) Compiler which has traditionally produced much better code than even GCC 2.x and is used a lot in industry (especially with VxWorks) - the other choice might be CodeWarrior, which obviously has the Freescale guys working closely on it. You're going to be stuck in Windows or x86 Linux, though, for the best code generation.

If either of these compilers perform noticably better, I would say trying to fix gcc is not worth the hassle. I do not think it should be a requirement of a platform that the development tools must be free libre, if it means spending man-years of development bringing it up to scratch. It would be cheaper to buy a working, commercial compiler.

Another benefit of CodeWarrior and Wind River CC over GCC is also obvious; they have direct support for debugging the exact board you're working on, in a nicely usable GUI.

They also have usable, automatic profile-guided optimization on the development host. GCC 4 got this, but I happen to think it's absolutely terrible (performance improvements for the hassle involved were minimal, where I could actually think of something to test on Windows). Can you imagine why, if GCC 4 was any good, that Mozilla release builds would be generated with Microsoft Visual C++? :)

_________________
Matt Sealey


Top
   
PostPosted: Sat Apr 05, 2008 6:06 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
If either of these compilers perform noticably better, I would say trying to fix gcc is not worth the hassle. I do not think it should be a requirement of a platform that the development tools must be free libre, if it means spending man-years of development bringing it up to scratch. It would be cheaper to buy a working, commercial compiler.
I completely disagree. Whether you like it or not, Linux is picking up more and more speed, it's starting to infiltrate platforms and target marketes that were previously the realm of VxWorks and other RTOS. GCC is the #1 compiler to use everywhere, perhaps it's not the most efficient, but it's very close. Instead of dumping it for a closed-source non-portable compiler, I'd first get Freescale or any of the cpu players to contribute to GCC instead (exactly like Intel and AMD do). There's no point in complaining about code quality on gcc's part, if IBM/Freescale don't do sth about it.

And for the record, I've yet to see a GCC 2.x benchmark beating GCC 4. In every case I've run GCC 4 produces much better code than GCC 2, at least on G4 that I've done extensive testing the past months. Plus, gcc 2.x is obsolete for the past 3-4 years, why even bother?

And finally, I think it's dumb to wait for a compiler to cater for the most time consuming functions. It's never done where it matters, as both Gunnar and myself have proven many times here and in other places, nothing beats hand-written asm -and highly tuned C, you *can* actually force the compiler into producing more efficient code.


Top
   
 Post subject:
PostPosted: Sat Apr 05, 2008 7:07 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Gunnar, we will see if we can't get you a copy of CodeWarrior, but the regular gcc toolchain may be better for what you want to really do. We had a good meeting with the ColdFire Team yesterday and we expect them to be joining the discussion here soon. They will have some meaningful tips and tricks to suggest.

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
PostPosted: Sat Apr 05, 2008 9:02 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
I completely disagree. Whether you like it or not, Linux is picking up more and more speed, it's starting to infiltrate platforms and target marketes that were previously the realm of VxWorks and other RTOS.
Linux does not always imply gcc :)
Quote:
Instead of dumping it for a closed-source non-portable compiler, I'd first get Freescale or any of the cpu players to contribute to GCC instead (exactly like Intel and AMD do). There's no point in complaining about code quality on gcc's part, if IBM/Freescale don't do sth about it.
Freescale and IBM don't do something about it because they have compiler businesses to maintain! CodeWarrior and IBM xcc are big business for big code the same way Intel's compiler is for Intel.

Freescale have, however, contracted out GCC development for the e500 and Coldfire cores - their highly embedded stuff - I've pointed that out to you already on IRC - and even in my experience in embedded systems, when DIAB and GCC failed, CodeSourcery was put on the table even in 2001 as a project that could be funded to get GCC up to scratch on ColdFire. It's a shame the company I was working for spiralled into hell pretty soon after (they would have rather funded a logo on an F1 car, and a private jet for the executives, than pay for development).

I would seriously look at the time and money to be spent on improving GCC, and whether the code generation improvements is worth it, compared to simply using a known working compiler with specifically engineered processor support like DIAB or CodeWarrior. How far down the line do you want to see the results of improvement to GCC? A year? 2 years? How much code will be compiled up to that point with the broken GCC, and how much will have to be *recompiled* later?

I'll paraphrase a popular expression, GCC is only free if your time is worthless. I would rather Gunnar got on with his project than be developing for GCC. I'd rather you were optimizing Freevec, Mesa etc. than tweaking GCC.
Quote:
And for the record, I've yet to see a GCC 2.x benchmark beating GCC 4. (snip) Plus, gcc 2.x is obsolete for the past 3-4 years, why even bother?
I have yet to see GCC 3 or GCC 4 perform consistently as well as gcc 2.95.3 - or any commercial compiler, bar none.
Quote:
And finally, I think it's dumb to wait for a compiler to cater for the most time consuming functions.
Well, this was my point. Where it matters, you would hand-code the assembly anyway, and where it doesn't, you would hope that your compiler is just not as braindead as it could be. In my experience though, later versions of gcc 4 have improved in TECHNOLOGY (SSA trees and all that nonsense) which may have sped up the resource allocation/register usage, and given the actual compile process a speed boost, but the code generation has not gotten better on ANY platform. Autovectorisation is a joke, and more effort has been put into doing FP math in SSE units than actually improving core code generation. It may even be that it is bordering on impossible to express combining register moves from SSA/RTL such that many move.l can be transformed into groups of movem.l efficiently - evidence of poor translation from virtualised register description languages to real life code is exactly those "useless lines" - or it may be that it can be, but the actual benefits and the number of times it is ever achieved is minimal anyway.

I would put more effort into optimizing the system - take a holistic approach. Don't spend too much time trying to fix one tiny thing but spread it around so that gradual, small improvements can improve the whole system. Once the application code is compiled, fixing the compiler is redundant. But optimizing a runtime library or the kernel to do something better and faster, means that same application code can load and use them and gain performance without trying.

For example, Gunnar's project is about running m68k code on ColdFire, so we will assume he cannot recompile m68k code (if he could, why not just compile everything as ColdFire native? :) - at this point, compiler technology is redundant :D

_________________
Matt Sealey


Last edited by Neko on Mon Apr 07, 2008 1:31 pm, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Mon Apr 07, 2008 9:01 am 
Offline

Joined: Thu Mar 20, 2008 11:26 am
Posts: 5
Ok.. My first post to this forum.. I was really excited to see all the traffic so far in the forum. Had to really think about where I wanted to post first.


Thanks to everybody who is helping to poke holes in GCC on ColdFire implementation. I wanted to give you some more help and see if you point me in the right direction.

FSL has funded CodeSourcery for the last couple of years to make ColdFire on par with the other archs, in the GCC toolchain. My understanding was that we were making progress, but that we are of course still behind Diab and GHS. But that is to be expected. It will take a few more years to really get GCC to be competitive with some of the other compilers. Our belief is that the latest releases of GCC are catching up with and may have even surpassed our CodeWarrior compiler on some benchmarks and code examples. My goal is to keep pushing both platforms. So yes... We at FSL are actively pushing both toolchains and for good reason. CodeWarrior allows us to enable lots of customers from our very low end V1 ColdFires to our PowerA (PowerPC) products. We see a need for low cost tools as well and GCC fills a need in both baremetal (ELF compiler) as a low cost tool as well as our linux compilers for both MMU and non-MMU versions of the kernel.

Now...

I could use a few data points. What specific compiler release are you using?

The latest ColdFire and PowerA drops should be available on CodeSourcery's site. Also, we include in the M54455EVB box a DVD that has the latest GCC linux compiler. I can't remember if we include the ELF. I think you need to grab it from codesourcery.

Latest ColdFire GCC toolchains:

http://www.codesourcery.com/gnu_toolcha ... nload.html

Also.. An item of note.. A new release is due out...very soon. We have a scheduled Spring and Fall release. The download link I've just posted, is for the Fall 2007 release.

I'll also commit, that if you find areas for improvement, I'll make sure to get summary posts passed on and reviewed to see if we can get them worked into the next release.

-JWW


Top
   
 Post subject:
PostPosted: Mon Apr 07, 2008 10:03 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi JW,

Thanks for your feedback.
Quote:
I could use a few data points. What specific compiler release are you using?

The latest ColdFire and PowerA drops should be available on CodeSourcery's site. Also, we include in the M54455EVB box a DVD that has the latest GCC linux compiler. I can't remember if we include the ELF. I think you need to grab it from codesourcery.

Latest ColdFire GCC toolchains:

http://www.codesourcery.com/gnu_toolcha ... nload.html

Also.. An item of note.. A new release is due out...very soon. We have a scheduled Spring and Fall release. The download link I've just posted, is for the Fall 2007 release.

I'll also commit, that if you find areas for improvement, I'll make sure to get summary posts passed on and reviewed to see if we can get them worked into the next release.

-JWW
I use the following GCC from CodeSourcery:

[gvb@lc4eb2326035573 68k]$ m68k-linux-gnu-gcc -v
Using built-in specs.
Target: m68k-linux-gnu
Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu --enable-threads --disable-libmudflap --disable-libssp --disable-libgomp --disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --enable-shared --enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47 --with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls --prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux --with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc --with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc --enable-poison-system-directories --with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin --with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
Thread model: posix
gcc version 4.2.1 (Sourcery G++ Lite 4.2-47)


Which compiler flags would your recommend for code generaiton for 54455?

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Mon Apr 07, 2008 10:27 am 
Offline

Joined: Thu Mar 20, 2008 11:26 am
Posts: 5
Gunnar,

Did you use the prebuilt or did you build one from the source files?

I'm checking to see if I can find any recommendations.

-JWW


Top
   
 Post subject:
PostPosted: Tue Apr 08, 2008 9:11 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Gunnar,

Did you use the prebuilt or did you build one from the source files?

I'm checking to see if I can find any recommendations.

-JWW
Hi John,

Yes I've justed the prebuild package.
I used this compiler installer:
freescale-coldfire-4.2-47-m68k-linux-gnu-i686-pc-linux-gnu.tar.bz2


Top
   
 Post subject:
PostPosted: Tue Apr 08, 2008 12:46 pm 
Offline

Joined: Mon Apr 07, 2008 3:16 pm
Posts: 1
My first post here so, Hello all!

Gunnar whats your thoughts on this?

AustexSoftware Coldfire Tools

btw. did you get hold of Mathias? he was looking for other "amiga asm coldfire people", looks like a possible asset for your project.

regards


Top
   
 Post subject:
PostPosted: Tue Apr 08, 2008 3:38 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
My first post here so, Hello all!
Hi Dan, welcome :)
Quote:
btw. did you get hold of Mathias? he was looking for other "amiga asm coldfire people", looks like a possible asset for your project.
Speaking of Amiga stuff, why not give VBCC's ColdFire backend a whirl?

I'd actually be very interested if any version of VBCC could be packaged for Linux distributions as it seems to produce nicer code, faster than gcc does. On MorphOS it's almost a kind of a standard, although personally I wouldn't like to use it for production/commercial environments like a nuclear power station without some huge amount of testing :)

The OpenFirmware target by Frank Wille would be extremely fun to bring up to an acceptable standard.

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Thu Apr 10, 2008 12:34 pm 
Offline

Joined: Tue Nov 02, 2004 6:17 am
Posts: 28
Quote:
Hi Dan!
Your link to austexsoftware is not only interesting,
because of the Compiler - this Coldfire A1200- formfactor-board amazes me!

Getting back to well readable 68k/Coldfire-ASM is exiting! I will also follow Matt's advice to take a
look at the output of vbcc.

Regards,
Peter


Top
   
PostPosted: Sat Apr 12, 2008 3:00 am 
Offline

Joined: Tue Feb 14, 2006 2:01 pm
Posts: 75
Location: Germany
Quote:
I might be overlooking something but to me it looks like the Coldfire code generation in GCC 4 is badly broken.

Here is an example of a very simple loop copying 16 byte.
Please, please provide the code you used to create the resulting assembler. We want results that we can reproduce!

You might want to try out the different optimize options from gcc (-O2 vs. -O3 vs. -Os). This can make a huge difference!

I'm not saying gcc is good/bad. coldfire is a niche architecture and you cannot expect the same code optimizations as you see on x86.


Top
   
PostPosted: Sat Apr 12, 2008 4:12 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Please, please provide the code you used to create the resulting assembler. We want results that we can reproduce!
I agree, code would be good.
Quote:
I'm not saying gcc is good/bad. coldfire is a niche architecture and you cannot expect the same code optimizations as you see on x86.
I think at the end of the day the code is going to be as good as the mapping between literal C code and the architecture of the processor.

If you look at Gunnar's third code snippet, his optimization, you will notice that. movem.l will move these registers in *reverse* order. The only reason his optimization works is because the movem.l to store them back, has the same caveat.

I could pick at Gunnar's code for the sheer unreadability of it - the register ranges are written in the same order as GCC generated them - d0/d1/a0/a1. Why not specify a1/a0/d1/d0 in the real order movem.l moves them, for greater readability? Both assemble fine and produce the same effect, but you know can far easier track the data movement without having to have full knowledge of the manual :)

The same can come from the use of the lea instruction. Why do you need to use lea to add 16 to a value in a register? Isn't addi.l simply better here (GCC does not seem to mind generating it, so let's assume it's perfectly valid unless you are running on an original 68000)? Or why not use move.l to load two longwords and use addq.l interleaved?

(as a theoretical study and to further prove my point, explain how GCC would determine that adding 16 to a value in a register ( dst += 16; ) would be converted to "load effective address" to produce Gunnar's code)

Honestly, none of these are better than each other :D

The code generated by GCC2 is perfectly valid. VBCC produces almost exactly the same (and compiles it faster!).

The code generated by GCC4 has some weirdness in it. I think this is reason enough to move back to GCC2 or pick up CodeWarrior or DIAB. There is no doubt that Gunnar's optimization is faster, if you move to hand-made assembler.

But not generating movem.l here (or lea for that matter) is not a bug or a flaw in GCC, and I really cannot think of the number of times you would manually copy more than, say, 16 bytes rather than use memcpy() or so on, that would warrant GCC needing to optimize this)

Rather than hack around in GCC, may I suggest something Apple did - a header called ppc_intrinsics.h which is used everywhere in Apple examples. It allows a very simple way to, without manually worrying about hand assembly in C code, use direct instructions just like you would use AltiVec intrinsics or MMX intrinsics.

(this URL probably won't highlight as it has weird characters in it..)

http://gcc.gnu.org/viewcvs/*checkout*/b ... ion=108090

Check out the "astrcmp" function near the bottom of the file. Why not have a memory copy intrinsic for ColdFire which used movem.l and use it where you need to?

I actually think this header (it's GPL after all) should/would make it's way into Freevec, too, so we don't need a Mac to optimize our code :)

_________________
Matt Sealey


Top
   
PostPosted: Sat Apr 12, 2008 7:11 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Quote:
Please, please provide the code you used to create the resulting assembler. We want results that we can reproduce!
I agree, code would be good.
Quote:
I'm not saying gcc is good/bad. coldfire is a niche architecture and you cannot expect the same code optimizations as you see on x86.
I think at the end of the day the code is going to be as good as the mapping between literal C code and the architecture of the processor.

If you look at Gunnar's third code snippet, his optimization, you will notice that. movem.l will move these registers in *reverse* order. The only reason his optimization works is because the movem.l to store them back, has the same caveat.

I could pick at Gunnar's code for the sheer unreadability of it - the register ranges are written in the same order as GCC generated them - d0/d1/a0/a1. Why not specify a1/a0/d1/d0 in the real order movem.l moves them, for greater readability? Both assemble fine and produce the same effect, but you know can far easier track the data movement without having to have full knowledge of the manual :)

The same can come from the use of the lea instruction. Why do you need to use lea to add 16 to a value in a register? Isn't addi.l simply better here (GCC does not seem to mind generating it, so let's assume it's perfectly valid unless you are running on an original 68000)? Or why not use move.l to load two longwords and use addq.l interleaved?

(as a theoretical study and to further prove my point, explain how GCC would determine that adding 16 to a value in a register ( dst += 16; ) would be converted to "load effective address" to produce Gunnar's code)

Honestly, none of these are better than each other :D
Matt,

you are wrong in some technical facts here.
I'll give some explanation.
Quote:
Why do you need to use lea to add 16 to a value in a register? Isn't addi.l simply better here
You would always use LEA here, as LEA is much smaller.
Both LEA and ADDI.L produce the same result.
LEA is 4 Bytes long. ADDI.L is 6 Bytes long.
LEA saves you 2 bytes.

In addition to this LEA also makes it result available to subsequent instructions. This means you can use the register effected by LEA with the next instruction without getting a latency hit in your pipelined.
Please read the CF V4 manual regarding result forwarding.
Quote:
movem.l will move these registers in *reverse* order.
WORNG!
The order of the registers is from data register 0 to data register 7, then from address register 0 to address register 7.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 18 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group