All times are UTC-06:00




Post new topic  Reply to topic  [ 73 posts ] 
Author Message
 Post subject:
PostPosted: Fri Jan 04, 2008 7:58 am 
Offline

Joined: Thu Nov 18, 2004 11:48 am
Posts: 110
Quote:
Name one. Last I checked all the main distros ship 2.6 - SuSE 10.3, Fedora Core 8, Debian 4.0 is shipping with glibc 2.3.6 (argh!).
Gentoo
Quote:
Yeah we should all run out and spend 8 days compiling an Efika with all the apps and benchmarks we need just so we can test the performance characteristics of an improved memcpy :)
Took me 1 day to have the efika self-build a base glibc system.
Using a cross-distcc setup (I set one up at my place at polito) it takes pretty much no effort and decreases build time in quite an impressive way.
Quote:
I think it'd be easier if we could ship patches against the current *distributions* and not the mainline code, so we can all see benefits this year and not in 2009.
I think it's saner to have both.


Top
   
 Post subject:
PostPosted: Fri Jan 04, 2008 10:22 am 
Offline

Joined: Mon Aug 21, 2006 2:57 pm
Posts: 38
Location: Austin, TX, USA
Quote:
Name one. Last I checked all the main distros ship 2.6 - SuSE 10.3, Fedora Core 8, Debian 4.0 is shipping with glibc 2.3.6 (argh!).
Fedora 8 has glibc 2.7, what are you talking about?

Debian/unstable seems to have glibc 2.7 (which is very nice to see, given that it still seems to lack maintainership on PPC). I can't check with SuSE since I don't have an installed machine.

It would definitely make sense to use the new frameworks to do features like these.


Top
   
 Post subject:
PostPosted: Thu Mar 13, 2008 7:13 am 
Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6
Hello,

This is a very interesting thread, and I was wondering
how I could use it to speed up my own cell-project.

The speed of memcpy is obviously very much dependent on the
alignment of source and dest, and the crucial parameter
is the difference (source-dest). I made a few tests with
glibc-memcpy from yellow-dog-linux 5 and got these results
for 20MByte-copies:

/ (a-b)%16=
// 16/0 ---- 2038.80 MBytes/s
// 12 ------ 1772.10 MBytes/s
// 8 ------ 1622.55 MBytes/s
// 4 ------ 1983.63 MBytes/s
// 3 ------ 308.11 MBytes/s
// 2 ------ 307.59 MBytes/s
// 1 ------ 308.15 MBytes/s
// etc. There are only 4 sweet spots (16,12,8,4) All
others perform badly. Of course, one can try and provide aligned data, but this is often impossible, eg 3byte-rgb
images etc.

Inspired by the solutions provided here I wrote a version
using altivec and vec_perm, which is faster than the glibc
version and works well for any alignment, ie even across
odd addresses. This is the speed I get:

// (a-b)%16=
// 16/0 ---- 2418.76 MBytes/s
// 8 ------ 2163.13 MBytes/s
// 2 ------ 2163.32 MBytes/s
// 1 ------ 2184.64 MBytes/s
// etc. With the exception of 16 byte alignement, all others are equal.
I haven't seen the freevec or patched kernel versions,
so my solution may be obsolete. You can download a copy from
<http://www.hs-furtwangen.de/~dersch/memcpy_cell.c>
Btw: I do not get the figures cited in this thread
earlier, eg a pure char-copy never gets faster than 90MBytes/s with my tests, and the memcpy_ppc32 version
is ~700MBytes/s.

Regards, and thanks for the many suggestions
on this board!

Helmut Dersch


Top
   
 Post subject:
PostPosted: Thu Mar 13, 2008 8:28 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Hello,

This is a very interesting thread, and I was wondering
how I could use it to speed up my own cell-project.

The speed of memcpy is obviously very much dependent on the
alignment of source and dest
Hi Helmut,

Nice work!

Which platform did you test the results?
Did you use Playstaion 3 or some IBM Cell blade?
Did you comapre to glibc 32 bit or glibc 64 bit?


Here are some Benchmark results on Playstation 3 for a 16MB copy on various alignments comparing glib 2.7 64bit build with CELL patch for the same:

Alignment 0-0
glibc memcpy 1645 MB/sec
CELL memcpy 5979 MB/sec

Alignment 7-0
glibc memcpy 831 MB/sec
CELL memcpy 2559 MB/sec

Alignment 17-11
glibc memcpy 832 MB/sec
CELL memcpy 2559 MB/sec

The CELL memcpy is scalar only and should by now be included in the glibc source. Please mind that the numbers given above are bus saturation. This means 5980 MB/sec =
2990 MB read + 2990 MB written.


The unaligned case is tricky to easily optimize as the usual way to do this scalar is loading aligned then shifting and oring and then storing aligned again. Unfortunately the apropiate shift instructions form is microcoded on Cell.


Helmut, your altivec solution can of course avoid this nicely. If you add cache prefetching to your routine you should be able to increase performance significantly.


Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Thu Mar 13, 2008 9:10 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Image

Helmut,

Below is the new code of the glibc scalar cell memory copy routine. The prefetching might be of help for you.

For optimal performance I would recommend you
to prefetch the SRC 6 cache lines ahead
and to clear the DST 4 cache lines ahead.
Code:
#define r0 0
#define r3 3
#define r4 4
#define r5 5
#define r6 6
#define r7 7
#define r8 8
#define r9 9
#define r10 10
#define r11 11
#define r12 12


#define PREFETCH_AHEAD 6
#define ZERO_AHEAD 4

/* memcpy routine optimized for CELL-BE-PPC v2.0
*
* The CELL PPC core has 1 integerunit and 1 load/store unit
* CELL:
* 1st level data cache = 32K
* 2nd level data cache = 512K
* 3rd level data cache = 0K
* With 3.2 GHz clockrate the latency to 2nd level cache is >36 clocks,
* latency to memory is >400 clocks
* To improve copy performance we need to prefetch source data
* far ahead to hide this latency
* For best performance instructionforms ending in "." like "andi."
* should be avoided as the are implemented in microcode on CELL.
* The below code is loop unrolled for the CELL cache line of 128 bytes
*/


#include <sysdep.h>
#include <bp-sym.h>
#include <bp-asm.h>


EALIGN (BP_SYM (memcpy), 5, 0)
CALL_MCOUNT 3

dcbt 0,r4 /* Prefetch ONE SRC cacheline */
cmpldi cr1,r5,16 /* is size < 16 ? */
mr r6,r3
blt+ cr1,.Lshortcopy

.Lbigcopy:
neg r8,r3 /* LS 3 bits = # bytes to 8-byte dest bdry */
clrldi r8,r8,64-4 /* aling to 16byte boundary */
sub r7,r4,r3
cmpldi cr0,r8,0
beq+ .Ldst_aligned

.Ldst_unaligned:
mtcrf 0x01,r8 /* put #bytes to boundary into cr7 */
subf r5,r8,r5

bf cr7*4+3,1f
lbzx r0,r7,r6 /* copy 1 byte */
stb r0,0(r6)
addi r6,r6,1
1: bf cr7*4+2,2f
lhzx r0,r7,r6 /* copy 2 byte */
sth r0,0(r6)
addi r6,r6,2
2: bf cr7*4+1,4f
lwzx r0,r7,r6 /* copy 4 byte */
stw r0,0(r6)
addi r6,r6,4
4: bf cr7*4+0,8f
ldx r0,r7,r6 /* copy 8 byte */
std r0,0(r6)
addi r6,r6,8
8:
add r4,r7,r6

.Ldst_aligned:

cmpdi cr5,r5,128-1

neg r7,r6
addi r6,r6,-8 /* prepare for stdu */
addi r4,r4,-8 /* prepare for ldu */

clrldi r7,r7,64-7 /* align to cacheline boundary */
ble+ cr5,.Llessthancacheline


cmpldi cr6,r7,0
subf r5,r7,r5
srdi r7,r7,4 /* divide size by 16 */
srdi r10,r5,7 /* number of cache lines to copy */


cmpldi r10,0
li r11,0 /* number cachelines to copy with prefetch */
beq .Lnocacheprefetch

cmpldi r10,PREFETCH_AHEAD
li r12,128+8 /* prefetch distance*/
ble .Llessthanmaxprefetch

subi r11,r10,PREFETCH_AHEAD
li r10,PREFETCH_AHEAD
.Llessthanmaxprefetch:

mtctr r10
.LprefetchSRC:
dcbt r12,r4
addi r12,r12,128
bdnz .LprefetchSRC
.Lnocacheprefetch:


mtctr r7
cmpldi cr1,r5,128
clrldi r5,r5,64-7

beq cr6,.Lcachelinealigned /* */
.Laligntocacheline:
ld r9,0x08(r4)
ldu r7,0x10(r4)
std r9,0x08(r6)
stdu r7,0x10(r6)
bdnz .Laligntocacheline


.Lcachelinealigned: /* copy while cache lines */


blt- cr1,.Llessthancacheline /* size <128 */

.Louterloop:
cmpdi r11,0
mtctr r11
beq- .Lendloop

li r11,128*ZERO_AHEAD +8 /* DCBZ dist */

.align 4
/* Copy whole cachelines, optimized by prefetching SRC cacheline */
.Lloop: /* Copy aligned body */
dcbt r12,r4 /* PREFETCH SOURCE some cache lines ahead*/
ld r9, 0x08(r4)
dcbz r11,r6
ld r7, 0x10(r4) /* 4 register stride copy */
ld r8, 0x18(r4) /* 4 are optimal to hide 1st level cache lantency*/
ld r0, 0x20(r4)
std r9, 0x08(r6)
std r7, 0x10(r6)
std r8, 0x18(r6)
std r0, 0x20(r6)
ld r9, 0x28(r4)
ld r7, 0x30(r4)
ld r8, 0x38(r4)
ld r0, 0x40(r4)
std r9, 0x28(r6)
std r7, 0x30(r6)
std r8, 0x38(r6)
std r0, 0x40(r6)
ld r9, 0x48(r4)
ld r7, 0x50(r4)
ld r8, 0x58(r4)
ld r0, 0x60(r4)
std r9, 0x48(r6)
std r7, 0x50(r6)
std r8, 0x58(r6)
std r0, 0x60(r6)
ld r9, 0x68(r4)
ld r7, 0x70(r4)
ld r8, 0x78(r4)
ldu r0, 0x80(r4)
std r9, 0x68(r6)
std r7, 0x70(r6)
std r8, 0x78(r6)
stdu r0, 0x80(r6)

bdnz .Lloop
.Lendloop:


cmpdi r10,0
sldi r10,r10,2 /* adjust from 128 to 32 byte stride */
beq- .Lendloop2
mtctr r10
.Lloop2: /* Copy aligned body */
ld r9, 0x08(r4)
ld r7, 0x10(r4)
ld r8, 0x18(r4)
ldu r0, 0x20(r4)
std r9, 0x08(r6)
std r7, 0x10(r6)
std r8, 0x18(r6)
stdu r0, 0x20(r6)

bdnz .Lloop2

.Lendloop2:


.Llessthancacheline: /* less than cache to do ? */
cmpldi cr0,r5,16
srdi r7,r5,4 /* divide size by 16 */
blt- .Ldo_lt16
mtctr r7
.Lcopy_remaining:
ld r8,0x08(r4)
ldu r7,0x10(r4)
std r8,0x08(r6)
stdu r7,0x10(r6)
bdnz .Lcopy_remaining


.Ldo_lt16: /* less than 16 ? */
cmpldi cr0,r5,0 /* copy remaining bytes (0-15) */
beqlr+ /* no rest to copy */
addi r4,r4,8
addi r6,r6,8
.Lshortcopy: /* SIMPLE COPY to handle size =< 15 bytes */
mtcrf 0x01,r5
sub r7,r4,r6
bf- cr7*4+0,8f
ldx r0,r7,r6 /* copy 8 byte */
std r0,0(r6)
addi r6,r6,8
8:
bf cr7*4+1,4f
lwzx r0,r7,r6 /* copy 4 byte */
stw r0,0(r6)
addi r6,r6,4
4:
bf cr7*4+2,2f
lhzx r0,r7,r6 /* copy 2 byte */
sth r0,0(r6)
addi r6,r6,2
2:
bf cr7*4+3,1f
lbzx r0,r7,r6 /* copy 1 byte */
stb r0,0(r6)
1: blr


Top
   
 Post subject:
PostPosted: Thu Mar 13, 2008 10:47 am 
Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6
>Which platform did you test the results?
>Did you use Playstaion 3 or some IBM Cell blade?
>Did you comapre to glibc 32 bit or glibc 64 bit?

Hi Gunnar,

I am using the PS3 with yellow-dog-linux 5 and
32-bit-userspace. The program which I posted
is 32bit only; the pointer casts have to be changed
for 64bit environment.

Thanks for the additional code; I will try that
tomorrow and post the results here.

Regards

Helmut


Top
   
 Post subject:
PostPosted: Thu Mar 13, 2008 1:26 pm 
Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6
>Please mind that the numbers given above are bus saturation. >This means 5980 MB/sec =
>2990 MB read + 2990 MB written.

I was referring to moved memory, so my numbers have to
be multiplied by 2. With this easy improvement :-)
and your suggestion to use cache preloading (I'm inserting
the line
asm volatile ("dcbt 0,%0" : : "r" (&s[48]));
the 48 is optimized by trial and error)
I am now getting 6200-6300MBytes/sec, independent of
alignment. I'll clean up the code and post it
tomorrow.

Regards

Helmut Dersch


Top
   
 Post subject:
PostPosted: Fri Mar 14, 2008 3:26 am 
Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6
I have uploaded a revised altivec-memcpy to
<http://www.hs-furtwangen.de/~dersch/memcpy_cell.c>
Changes:
- added cache preloading (thanks to Gunnar von Boehn)
- code for 32bit and 64bit address space
- restricted qword fetches to allowed memory regions
- bugfix aligned copy
- speed now given in read+write (i.e. twice the previous
numbers

Measuring speed is more difficult than I thought: I was
simply copying a 20MByte block repeatedly between two
locations, when I noticed, that speed depends on whether
I previously wrote to and initialized the source region.
The difference is 6380/4340MByte/s. The same happens with
standard memcpy (unaligned 618/476; aligned 3900/1430).
This seams to be a gcc-optimization artefact, since the
difference dissapears without optimization. I am afraid
that even the lower numbers may somehow be affected.
Anyway: a significant increase in speed for
unaligned memcpy can be achieved.

Regards

Helmut Dersch


Top
   
 Post subject:
PostPosted: Fri Mar 14, 2008 6:28 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi Helmut,

If you use DCBZ to clear the DST cache line
then the memory utilization on the bus will go down.
A good working value is to DCBZ 4 cache lines ahead.
If your memcpy copies 20MB then it will moving 60MB over the bus without DCBZ. With DCBZ it only need to move 40 MB to do the work.

The reason for this is that the CPU has of course to read the SRC but in addition to this when the CPU copies 1 byte from SRC to DST the CPU will need to burst in the DST cache line too.
When you use DCBZ on the DST the CPU know that the cache line allocated for the area in the CPU is newest and will not read the DST prior to be overwritten.

With ~6000 MB/sec your memcopy is very good
Your limit at the moment is the speed of the 2nd level cache.
You will not be able to increase this using the CPU.
If you use DCBZ the memcopy will not speed up but Bus saturation will go down leaving more free memory bandwith for the SPEs.

What is the amount of memory that your application needs to copy normally?
If you have a free SPE then you can use that one for memcopy.
Because the SPE can do copy memory much faster than the PPC.
You should be able to get ~24MB/sec.

If you want/need to do the copy with the PPC then I would recommend to disable the compiled copy and the check which load instructions the compiler uses.
If you copy big amounts of data this will move all your current data and code out of the CPU cache.
Altivec has load instructions which will limit the cache trashing to 1way of the cache.
The CELL has a 8-way 2nd level cache.
If your code us non trashing load you will keep 7/8 of your cache.

Cheers

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Fri Mar 14, 2008 11:40 am 
Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6
Hello Gunnar,

I have implemented and posted the DCBZ variant together
with a few other changes to the same address, and am now
getting consistently 5830MBytes/sec independent of alignment
and previous memory accesses. Thanks a lot for this
info and the additional suggestions about freeing the
CPU-cache which I have not tried yet.

> What is the amount of memory that your application needs to copy normally?
>If you have a free SPE then you can use that one for memcopy.

We are working on a fast panorama stitcher, which assembles
many (>300) images to gigapixel panoramas; the 6 spus
are busy warping the input images, while the ppu is
responsible for blending. This includes much copying and pasting, and a fast memcpy will surely help. Typical
blocksizes are one pixel row (10kB-200kB). Btw: an already
quite fast preview version can be downloaded from my website
<http://www.fh-furtwangen.de/~dersch/PTS ... .1a.tar.gz>

Thanks again for helping!

Regards

Helmut Dersch


Top
   
 Post subject:
PostPosted: Sat Mar 15, 2008 2:46 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Hello Gunnar,

I have implemented and posted the DCBZ variant together
with a few other changes to the same address, and am now
getting consistently 5830MBytes/sec independent of alignment
and previous memory accesses.
Hi Helmut,

I see that you added the DCBZ so that its clearing somewhat ahead.

asm volatile ("dcbz 0,%0" : : "r" (&d[16])); // clear dst cache

BTW in my tests it showed that clearing one cache line ahead was too short for the relative high latency of the CELL.
I was used to clear one cache line ahead on other PowerPC systems but on CELL the distance of 1 line was reducing performance instead of improving it. This confused me a lot! I thought for some time that the DCBZ is contra productive on CELL.
After doing some for testing, I realized that at least on my system (IBM Cell Blade) it showed that giving the CELL more time for the DCBZ (using a longer distance) will be positive for performance.

I found that for the systems that I use a good distance is 4 lines ahead.

As more ahead we will go with the DCBZ and DCBT as more time we give the system for this. There is a disadvantage of going to far ahead, as the 8way 2n level cache will at some point loose the prefetched cache because over overwrite other lines that were either cleared or prefetched.

For me it showed putting the DCBZ after the first load can reduce this a little but as it ensures that 1 line gets moved to the 1st level cache before the DCBZ is executed.

Your results may vary but maybe you can check and test
if increasing the DCBZ distance will improve your performance.

For my system it was optimal to prefetch 6 cache lines ahead
(6*128 byte) and to clear 4 cache lines ahead (4*128 byte)

Of course your playstation has a little bit different timings than the IBM blade. The IBM blade has faster memory which increases memcopy performance (I get 7GB/sec with the same scaler code that gets 6GB/Sec on the PS3).
On the other hand you have on IBM blade a numa situation, where one CELL core sometimes wants to work in the memory of the other core. For this case and for the cache cohency protocol of the two CELL chips is important to CLR and PREFECTH far enough ahead.

Do you want to run your code later on blades too?

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Sat Mar 15, 2008 4:58 am 
Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6
>Your results may vary but maybe you can check and test
>if increasing the DCBZ distance will improve your performance.

The values I used are the optimum between 0 and 128
(both for dcbz and dcbt) on the PS3.
Btw: I made a mistake and cleared dest-cache beyond
the destination range. This is corrected now.
I also added a (rather trivial) altivec-memset
(also quite often used in my program), which yields 6000
MBytes/sec compared to 820 MBytes/sec in glibc.

This dcbz/dcbt trick is really neat and I guess
I can use it in a lot more other cases.

>Do you want to run your code later on blades too?

Good idea; I will check that.

Regards

Helmut


Top
   
 Post subject:
PostPosted: Fri Mar 21, 2008 6:18 pm 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
...how did it go?

_________________
http://bbrv.blogspot.com


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 73 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 25 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group