Quote:
Quote:
- DCBZ is great for increasing performance on e300/G3/G4/G5 - but it will hurt performance badly on CELL. (I have searched a lot for the explanation why it does hurt on CELL but found the answer no where)
dcbz on g5 working fine? Quite strange. dcbzl on G5 (and probably Cell) should work better.
UPDATE: I asked segher about that and he told me that dcbz clearing just 32bit instead of a line was an apple workaround to let closed source application keep working, on linux dcbz should cleanup a cache line... So, now I'm puzzled ^^;
lu
Segher is right.
To be bug compatible with old PPC software the 970 can operate in two modes. You have a bit in its config register to run in either "normal" or "compatible" mode.
a) The normal behavior is that DCBZ will clear the whole cache line (128 byte). Linux does operate in this mode.
b) Mac OS uses compatibility behavior, to ensure working with old programs that assume every CPU has 32byte cache lines.
All PowerPC CPUs used by Apple in the last 10 years (60x,750,74xx) had cache lines of 32byte size.
So there is a number of PPC optimized software which might make the risky assumption that all CPUs ever have a cacheline of 32byte.
In compatibbility behavior that DCBZ will only clear only one cache line sector (32byte).
This helps to be compatible with old programs.
The reason is if you use a optimized copy routine like the one for the EFIKA than this copy was always written for PPC CPUs with 32byte cache line.
Such a copy would clear the cache line and then copy 32byte.
If you run such a routine on a 128 byte cache line CPU, then it would misfunction.
A typical 32byte copy loop takes one iteration to copy the whole cache line on G2/G3/G4 but it would take for iteration to copy a whole line on G5.
- in the 1st copy iteration the copy you would clear the 128 byte cache line and set the first 32 byte.
- 2nd iteration the DCBZ would clear the cache line (again all 128 byte) and the set the bytes 32-63.
The problem is that Bytes 0-31 are lost now as overwritten by th DCBZ again.
- 3rd iteration the DCBZ will clear the cache line (again all 128 byte) and the set the bytes 64-95.
And again the problem is that Bytes 0-63 are lost now as overwritten by them last DCBZ.
- in the 4 iteration and lost iteration for that line, the copy would again clear the whole line (all 128 bytes) and set the last sector bytes 96-127.
After touching the last byte of the cache line the CPU could flush the cacheline out to memory.
The problem as we clearly see is that the first 3x32 bytes are set to zero (by DCBZ) and only the last 32 bytes of the cacheline would have the correct value.
On Linux the DCBZ operates in the original behavior of clearing the whole cache line.
This is no problem on Linux as there is nearly no Software optimized for PPC and mostly all the software is opensource and could be updated quickly anyway.
On MacOS there is software which used DCBZ for speed optimization but incorrectly assumes the CPU having a 32byte cache line.
I hope that I could explain it somehow. :-)
This DCBZ compatibility behavior and the extra DCBZL instruction is only available in the 970 CPU.
It was extra added to be "bug-compatible" with Mac software.
Cheers
gunnar