All times are UTC-06:00




Post new topic  Reply to topic  [ 6 posts ] 
Author Message
PostPosted: Fri Dec 14, 2007 3:18 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
There is plenty of free information available on how to write PPC code. But I've never seen good tips or knowledge about how to write fast code.

When I think about writing the memcpy there are numerous "rules" that you need to follow to write fast code for PPC.
I have never seen them written down somewhere.

For example:
  • DCBZ is great for increasing performance on e300/G3/G4/G5 - but it will hurt performance badly on CELL. (I have searched a lot for the explanation why it does hurt on CELL but found the answer no where)

    DCBT is great for increasing performance on G3/G4/G5/CELL but it hurts performance on e300 and Power4.

    Using 4 register for the main copy loop gives best perfomance on e300 and CELL while on 970 and Power using only 2 registers is much better.
I wonder if it would make sense to "collect" this type of knowledge for the public?
Maybe a wiki for experienced developers to contribute "tips" and tricks?

What do you think would this make sense?
I think a collection of tips is better than everybody having to find this out be trial and error on his own.

BBRV maybe you could help and get compiler writer tips from the PPC vendors (IBM, Freescale)?



Cheers
Gunnar


Top
   
PostPosted: Fri Dec 14, 2007 5:08 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
When I think about writing the memcpy there are numerous "rules" that you need to follow to write fast code for PPC.
I have never seen them written down somewhere.
Aren't a lot of them in the PowerPC Compiler Writer's Guide and the 32-bit Programming Environments Manual, and common sense if you read the docs for the chips?
Quote:
Using 4 register for the main copy loop gives best perfomance on e300 and CELL while on 970 and Power using only 2 registers is much better.
Wow, now that IS esoteric. Why is that?

I wonder if it would make sense to "collect" this type of knowledge for the public?
Maybe a wiki for experienced developers to contribute "tips" and tricks?

What do you think would this make sense?
I think a collection of tips is better than everybody having to find this out be trial and error on his own.
Quote:
BBRV maybe you could help and get compiler writer tips from the PPC vendors (IBM, Freescale)?
I think this is a good idea, the problem is finding a way to organise the information. How do you categorize something like "how many registers should I use for a loop in an memcpy implementation"? How many times does someone need to read that before it's implemented in glibc, bsd libc, newlib, and then.. is common enough? :)

Right now to fill some slow parts of my day I am looking at a better way to organise Power Developer as the "Xoops Style CMS with PHPBB forum" is really not cutting it these days. However the alternatives (Drupal, a true Wiki..) come with some mind-boggling caveats and a need for some very specific, very dedicated maintenance (like hiring 5 guys to run the site, review articles etc. and write custom modules and keep track of the source code issues etc.).

Perhaps you could collect the information and we can work out a way to run a site with that information in.

_________________
Matt Sealey


Top
   
PostPosted: Tue Dec 18, 2007 8:00 pm 
Offline

Joined: Thu Nov 18, 2004 11:48 am
Posts: 110
Quote:
  • DCBZ is great for increasing performance on e300/G3/G4/G5 - but it will hurt performance badly on CELL. (I have searched a lot for the explanation why it does hurt on CELL but found the answer no where)
dcbz on g5 working fine? Quite strange. dcbzl on G5 (and probably Cell) should work better.

UPDATE: I asked segher about that and he told me that dcbz clearing just 32bit instead of a line was an apple workaround to let closed source application keep working, on linux dcbz should cleanup a cache line... So, now I'm puzzled ^^;

lu


Top
   
PostPosted: Wed Dec 19, 2007 1:35 pm 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Quote:
  • DCBZ is great for increasing performance on e300/G3/G4/G5 - but it will hurt performance badly on CELL. (I have searched a lot for the explanation why it does hurt on CELL but found the answer no where)
dcbz on g5 working fine? Quite strange. dcbzl on G5 (and probably Cell) should work better.

UPDATE: I asked segher about that and he told me that dcbz clearing just 32bit instead of a line was an apple workaround to let closed source application keep working, on linux dcbz should cleanup a cache line... So, now I'm puzzled ^^;

lu
Segher is right.

To be bug compatible with old PPC software the 970 can operate in two modes. You have a bit in its config register to run in either "normal" or "compatible" mode.

a) The normal behavior is that DCBZ will clear the whole cache line (128 byte). Linux does operate in this mode.

b) Mac OS uses compatibility behavior, to ensure working with old programs that assume every CPU has 32byte cache lines.
All PowerPC CPUs used by Apple in the last 10 years (60x,750,74xx) had cache lines of 32byte size.
So there is a number of PPC optimized software which might make the risky assumption that all CPUs ever have a cacheline of 32byte.

In compatibbility behavior that DCBZ will only clear only one cache line sector (32byte).

This helps to be compatible with old programs.
The reason is if you use a optimized copy routine like the one for the EFIKA than this copy was always written for PPC CPUs with 32byte cache line.
Such a copy would clear the cache line and then copy 32byte.
If you run such a routine on a 128 byte cache line CPU, then it would misfunction.

A typical 32byte copy loop takes one iteration to copy the whole cache line on G2/G3/G4 but it would take for iteration to copy a whole line on G5.

- in the 1st copy iteration the copy you would clear the 128 byte cache line and set the first 32 byte.
- 2nd iteration the DCBZ would clear the cache line (again all 128 byte) and the set the bytes 32-63.
The problem is that Bytes 0-31 are lost now as overwritten by th DCBZ again.
- 3rd iteration the DCBZ will clear the cache line (again all 128 byte) and the set the bytes 64-95.
And again the problem is that Bytes 0-63 are lost now as overwritten by them last DCBZ.
- in the 4 iteration and lost iteration for that line, the copy would again clear the whole line (all 128 bytes) and set the last sector bytes 96-127.
After touching the last byte of the cache line the CPU could flush the cacheline out to memory.

The problem as we clearly see is that the first 3x32 bytes are set to zero (by DCBZ) and only the last 32 bytes of the cacheline would have the correct value.


On Linux the DCBZ operates in the original behavior of clearing the whole cache line.
This is no problem on Linux as there is nearly no Software optimized for PPC and mostly all the software is opensource and could be updated quickly anyway.

On MacOS there is software which used DCBZ for speed optimization but incorrectly assumes the CPU having a 32byte cache line.

I hope that I could explain it somehow. :-)

This DCBZ compatibility behavior and the extra DCBZL instruction is only available in the 970 CPU.
It was extra added to be "bug-compatible" with Mac software.

Cheers
gunnar


Top
   
 Post subject:
PostPosted: Sat Dec 22, 2007 2:37 pm 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
We ought to turn this sort of information into an AppNote. We can follow the Freescale format...

Image

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
PostPosted: Thu Dec 27, 2007 10:21 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
[quote="gunnar"]There is plenty of free information available on how to write PPC code. But I've never seen good tips or knowledge about how to write fast code.

When I think about writing the memcpy there are numerous "rules" that you need to follow to write fast code for PPC.
I have never seen them written down somewhere.

For example:
  • DCBZ is great for increasing performance on e300/G3/G4/G5 - but it will hurt performance badly on CELL. (I have searched a lot for the explanation why it does hurt on CELL but found the answer no where)

    Update:

    I realized why DCBZ is not working as I expected on CELL.
    Is actually quite simply :-)

    The CELL has "a sort of" 256 byte wide 2nd level cache.
    If you unroll the work loop to process 256 byte in a row and use 2 x DCBZ and 2x DCBT in the loop for clear and prefetch and have your SRC/DST pointers aligned on 256 byte boundary then the CPU starts to rock ...

    Quiet simple answer. :-)
    I hope this info is of help for someone.

    Cheers
    Gunnar


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 6 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 41 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group