Power Developer • Possible benefits - optimization for PowerPC

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

Possible benefits - optimization for PowerPC

Post new topic Reply to topic

Page 5 of 5

[ 73 posts ]

Previous topic | Next topic

Author

Message

lu_zero

Post subject:

PostPosted: Fri Jan 04, 2008 7:58 am

Offline

Joined: Thu Nov 18, 2004 11:48 am
Posts: 110

Quote:

Name one. Last I checked all the main distros ship 2.6 - SuSE 10.3, Fedora Core 8, Debian 4.0 is shipping with glibc 2.3.6 (argh!).

Gentoo

Quote:

Yeah we should all run out and spend 8 days compiling an Efika with all the apps and benchmarks we need just so we can test the performance characteristics of an improved memcpy :)

Took me 1 day to have the efika self-build a base glibc system.
Using a cross-distcc setup (I set one up at my place at polito) it takes pretty much no effort and decreases build time in quite an impressive way.

Quote:

I think it'd be easier if we could ship patches against the current *distributions* and not the mainline code, so we can all see benefits this year and not in 2009.

I think it's saner to have both.

Top

Profile

Reply with quote

ojn

Post subject:

PostPosted: Fri Jan 04, 2008 10:22 am

Offline

Joined: Mon Aug 21, 2006 2:57 pm
Posts: 38
Location: Austin, TX, USA

Quote:

Name one. Last I checked all the main distros ship 2.6 - SuSE 10.3, Fedora Core 8, Debian 4.0 is shipping with glibc 2.3.6 (argh!).

Fedora 8 has glibc 2.7, what are you talking about?

Debian/unstable seems to have glibc 2.7 (which is very nice to see, given that it still seems to lack maintainership on PPC). I can't check with SuSE since I don't have an installed machine.

It would definitely make sense to use the new frameworks to do features like these.

Top

Profile

Reply with quote

HDe

Post subject:

PostPosted: Thu Mar 13, 2008 7:13 am

Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6

Hello,

This is a very interesting thread, and I was wondering
how I could use it to speed up my own cell-project.

The speed of memcpy is obviously very much dependent on the
alignment of source and dest, and the crucial parameter
is the difference (source-dest). I made a few tests with
glibc-memcpy from yellow-dog-linux 5 and got these results
for 20MByte-copies:

/ (a-b)%16=
// 16/0 ---- 2038.80 MBytes/s
// 12 ------ 1772.10 MBytes/s
// 8 ------ 1622.55 MBytes/s
// 4 ------ 1983.63 MBytes/s
// 3 ------ 308.11 MBytes/s
// 2 ------ 307.59 MBytes/s
// 1 ------ 308.15 MBytes/s
// etc. There are only 4 sweet spots (16,12,8,4) All
others perform badly. Of course, one can try and provide aligned data, but this is often impossible, eg 3byte-rgb
images etc.

Inspired by the solutions provided here I wrote a version
using altivec and vec_perm, which is faster than the glibc
version and works well for any alignment, ie even across
odd addresses. This is the speed I get:

// (a-b)%16=
// 16/0 ---- 2418.76 MBytes/s
// 8 ------ 2163.13 MBytes/s
// 2 ------ 2163.32 MBytes/s
// 1 ------ 2184.64 MBytes/s
// etc. With the exception of 16 byte alignement, all others are equal.
I haven't seen the freevec or patched kernel versions,
so my solution may be obsolete. You can download a copy from
<http://www.hs-furtwangen.de/~dersch/memcpy_cell.c>
Btw: I do not get the figures cited in this thread
earlier, eg a pure char-copy never gets faster than 90MBytes/s with my tests, and the memcpy_ppc32 version
is ~700MBytes/s.

Regards, and thanks for the many suggestions
on this board!

Helmut Dersch

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Thu Mar 13, 2008 8:28 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Hi Helmut,

Nice work!

Which platform did you test the results?
Did you use Playstaion 3 or some IBM Cell blade?
Did you comapre to glibc 32 bit or glibc 64 bit?

Here are some Benchmark results on Playstation 3 for a 16MB copy on various alignments comparing glib 2.7 64bit build with CELL patch for the same:

Alignment 0-0
glibc memcpy 1645 MB/sec
CELL memcpy 5979 MB/sec

Alignment 7-0
glibc memcpy 831 MB/sec
CELL memcpy 2559 MB/sec

Alignment 17-11
glibc memcpy 832 MB/sec
CELL memcpy 2559 MB/sec

The CELL memcpy is scalar only and should by now be included in the glibc source. Please mind that the numbers given above are bus saturation. This means 5980 MB/sec =
2990 MB read + 2990 MB written.

The unaligned case is tricky to easily optimize as the usual way to do this scalar is loading aligned then shifting and oring and then storing aligned again. Unfortunately the apropiate shift instructions form is microcoded on Cell.

Helmut, your altivec solution can of course avoid this nicely. If you add cache prefetching to your routine you should be able to increase performance significantly.

Cheers
Gunnar

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Thu Mar 13, 2008 9:10 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Helmut,

Below is the new code of the glibc scalar cell memory copy routine. The prefetching might be of help for you.

For optimal performance I would recommend you
to prefetch the SRC 6 cache lines ahead
and to clear the DST 4 cache lines ahead.

Code:

#define r0 0

#define r3 3

#define r4 4

#define r5 5

#define r6 6

#define r7 7

#define r8 8

#define r9 9

#define r10 10

#define r11 11

#define r12 12

#define PREFETCH_AHEAD 6

#define ZERO_AHEAD 4

/* memcpy routine optimized for CELL-BE-PPC 	v2.0 

 *

 * The CELL PPC core has 1 integerunit and 1 load/store unit

 * CELL:

 * 1st level data cache = 32K

 * 2nd level data cache = 512K

 * 3rd level data cache = 0K

 * With 3.2 GHz clockrate the latency to 2nd level cache is >36 clocks,

 * latency to memory is >400 clocks

 * To improve copy performance we need to prefetch source data

 * far ahead to hide this latency

 * For best performance instructionforms ending in "." like "andi."

 * should be avoided as the are implemented in microcode on CELL.

 * The below code is loop unrolled for the CELL cache line of 128 bytes

 */

#include <sysdep.h>

#include <bp-sym.h>

#include <bp-asm.h>

EALIGN (BP_SYM (memcpy), 5, 0)

        CALL_MCOUNT 3

	dcbt	0,r4		/* Prefetch ONE SRC cacheline */

	cmpldi	cr1,r5,16	/* is size < 16 ? */ 

	mr	r6,r3		

	blt+	cr1,.Lshortcopy

.Lbigcopy:

	neg	r8,r3		/* LS 3 bits = # bytes to 8-byte dest bdry */

        clrldi  r8,r8,64-4	/* aling to 16byte boundary */

	sub     r7,r4,r3

	cmpldi	cr0,r8,0

	beq+	.Ldst_aligned

.Ldst_unaligned:

	mtcrf	0x01,r8		/* put #bytes to boundary into cr7 */

	subf	r5,r8,r5

	bf	cr7*4+3,1f

	lbzx	r0,r7,r6	/* copy 1 byte */

	stb	r0,0(r6)

	addi	r6,r6,1

1:	bf	cr7*4+2,2f

	lhzx	r0,r7,r6	/* copy 2 byte */

	sth	r0,0(r6)

	addi	r6,r6,2

2:	bf	cr7*4+1,4f

	lwzx	r0,r7,r6	/* copy 4 byte */

	stw	r0,0(r6)

	addi	r6,r6,4

4:	bf	cr7*4+0,8f

	ldx	r0,r7,r6	/* copy 8 byte */

	std	r0,0(r6)

	addi	r6,r6,8

8:

	add	r4,r7,r6

.Ldst_aligned:

	cmpdi	cr5,r5,128-1

	neg	r7,r6

	addi	r6,r6,-8	/* prepare for stdu */

	addi	r4,r4,-8	/* prepare for ldu */

	clrldi  r7,r7,64-7	/* align to cacheline boundary */	

	ble+	cr5,.Llessthancacheline

	cmpldi	cr6,r7,0

	subf	r5,r7,r5

	srdi	r7,r7,4		/* divide size by 16 */

	srdi	r10,r5,7	/* number of cache lines to copy */

	cmpldi	r10,0

	li	r11,0			/* number cachelines to copy with prefetch */

	beq	.Lnocacheprefetch

	cmpldi	r10,PREFETCH_AHEAD

	li	r12,128+8		/* prefetch distance*/

	ble	.Llessthanmaxprefetch

	subi	r11,r10,PREFETCH_AHEAD

	li	r10,PREFETCH_AHEAD

.Llessthanmaxprefetch:

	mtctr	r10

.LprefetchSRC:

	dcbt    r12,r4

        addi    r12,r12,128

        bdnz    .LprefetchSRC

.Lnocacheprefetch:

	mtctr	r7

	cmpldi	cr1,r5,128	

	clrldi  r5,r5,64-7	

	beq	cr6,.Lcachelinealigned	/* 	*/

.Laligntocacheline:				

	ld 	r9,0x08(r4)

	ldu	r7,0x10(r4)

	std	r9,0x08(r6)

	stdu	r7,0x10(r6)

	bdnz	.Laligntocacheline		

.Lcachelinealigned:				/* copy while cache lines */

	blt- 	cr1,.Llessthancacheline		/* size <128 */

.Louterloop:

        cmpdi   r11,0		

	mtctr	r11

	beq-	.Lendloop

	li	r11,128*ZERO_AHEAD +8		/* DCBZ dist */

.align	4

	/* Copy whole cachelines, optimized by prefetching SRC cacheline */

.Lloop: 				/* Copy aligned body */

	dcbt    r12,r4			/* PREFETCH SOURCE some cache lines ahead*/

        ld      r9, 0x08(r4)    	 

	dcbz	r11,r6

        ld      r7, 0x10(r4)    	/* 4 register stride copy */

        ld      r8, 0x18(r4)		/* 4 are optimal to hide 1st level cache lantency*/

        ld      r0, 0x20(r4)

        std     r9, 0x08(r6)

        std     r7, 0x10(r6)

        std     r8, 0x18(r6)

        std     r0, 0x20(r6)

        ld      r9, 0x28(r4)

        ld      r7, 0x30(r4)

        ld      r8, 0x38(r4)

        ld      r0, 0x40(r4)

        std     r9, 0x28(r6)

        std     r7, 0x30(r6)

        std     r8, 0x38(r6)

        std     r0, 0x40(r6)

        ld      r9, 0x48(r4)

        ld      r7, 0x50(r4)

        ld      r8, 0x58(r4)

        ld      r0, 0x60(r4)

        std     r9, 0x48(r6)

        std     r7, 0x50(r6)

        std     r8, 0x58(r6)

        std     r0, 0x60(r6)

        ld      r9, 0x68(r4)

        ld      r7, 0x70(r4)

        ld      r8, 0x78(r4)

        ldu     r0, 0x80(r4)

        std     r9, 0x68(r6)

        std     r7, 0x70(r6)

        std     r8, 0x78(r6)

        stdu    r0, 0x80(r6)

	bdnz    .Lloop

.Lendloop:

        cmpdi   r10,0		

	sldi    r10,r10,2         	/* adjust from 128 to 32 byte stride */

        beq-     .Lendloop2

        mtctr 	r10

.Lloop2: 				/* Copy aligned body */

        ld      r9, 0x08(r4)

        ld      r7, 0x10(r4)

        ld      r8, 0x18(r4)

        ldu     r0, 0x20(r4)

        std     r9, 0x08(r6)

        std     r7, 0x10(r6)

        std     r8, 0x18(r6)

        stdu    r0, 0x20(r6)

	bdnz    .Lloop2

.Lendloop2:

.Llessthancacheline:		/* less than cache to do ? */

	cmpldi	cr0,r5,16

	srdi	r7,r5,4		/* divide size by 16 */

        blt-    .Ldo_lt16

	mtctr	r7

.Lcopy_remaining:

	ld 	r8,0x08(r4)

	ldu	r7,0x10(r4)

	std	r8,0x08(r6)

	stdu	r7,0x10(r6)

	bdnz	.Lcopy_remaining

.Ldo_lt16:			/* less than 16 ? */

	cmpldi	cr0,r5,0	/* copy remaining bytes (0-15) */

	beqlr+			/* no rest to copy */	

	addi	r4,r4,8

	addi	r6,r6,8

.Lshortcopy:			/* SIMPLE COPY to handle size =< 15 bytes */

	mtcrf	0x01,r5

	sub     r7,r4,r6

	bf-	cr7*4+0,8f

	ldx	r0,r7,r6	/* copy 8 byte */

	std	r0,0(r6)

	addi	r6,r6,8

8:

	bf	cr7*4+1,4f

	lwzx	r0,r7,r6	/* copy 4 byte */

	stw	r0,0(r6)

	addi	r6,r6,4

4:

	bf	cr7*4+2,2f

	lhzx	r0,r7,r6	/* copy 2 byte */

	sth	r0,0(r6)

	addi	r6,r6,2

2:

	bf	cr7*4+3,1f

	lbzx	r0,r7,r6	/* copy 1 byte */

	stb	r0,0(r6)	

1:	blr

Top

Profile

Reply with quote

HDe

Post subject:

PostPosted: Thu Mar 13, 2008 10:47 am

Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6

>Which platform did you test the results?
>Did you use Playstaion 3 or some IBM Cell blade?
>Did you comapre to glibc 32 bit or glibc 64 bit?

Hi Gunnar,

I am using the PS3 with yellow-dog-linux 5 and
32-bit-userspace. The program which I posted
is 32bit only; the pointer casts have to be changed
for 64bit environment.

Thanks for the additional code; I will try that
tomorrow and post the results here.

Regards

Helmut

Top

Profile

Reply with quote

HDe

Post subject:

PostPosted: Thu Mar 13, 2008 1:26 pm

Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6

>Please mind that the numbers given above are bus saturation. >This means 5980 MB/sec =
>2990 MB read + 2990 MB written.

I was referring to moved memory, so my numbers have to
be multiplied by 2. With this easy improvement :-)
and your suggestion to use cache preloading (I'm inserting
the line
asm volatile ("dcbt 0,%0" : : "r" (&s[48]));
the 48 is optimized by trial and error)
I am now getting 6200-6300MBytes/sec, independent of
alignment. I'll clean up the code and post it
tomorrow.

Regards

Helmut Dersch

Top

Profile

Reply with quote

HDe

Post subject:

PostPosted: Fri Mar 14, 2008 3:26 am

Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6

I have uploaded a revised altivec-memcpy to
<http://www.hs-furtwangen.de/~dersch/memcpy_cell.c>
Changes:
- added cache preloading (thanks to Gunnar von Boehn)
- code for 32bit and 64bit address space
- restricted qword fetches to allowed memory regions
- bugfix aligned copy
- speed now given in read+write (i.e. twice the previous
numbers

Measuring speed is more difficult than I thought: I was
simply copying a 20MByte block repeatedly between two
locations, when I noticed, that speed depends on whether
I previously wrote to and initialized the source region.
The difference is 6380/4340MByte/s. The same happens with
standard memcpy (unaligned 618/476; aligned 3900/1430).
This seams to be a gcc-optimization artefact, since the
difference dissapears without optimization. I am afraid
that even the lower numbers may somehow be affected.
Anyway: a significant increase in speed for
unaligned memcpy can be achieved.

Regards

Helmut Dersch

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Fri Mar 14, 2008 6:28 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Hi Helmut,

If you use DCBZ to clear the DST cache line
then the memory utilization on the bus will go down.
A good working value is to DCBZ 4 cache lines ahead.
If your memcpy copies 20MB then it will moving 60MB over the bus without DCBZ. With DCBZ it only need to move 40 MB to do the work.

The reason for this is that the CPU has of course to read the SRC but in addition to this when the CPU copies 1 byte from SRC to DST the CPU will need to burst in the DST cache line too.
When you use DCBZ on the DST the CPU know that the cache line allocated for the area in the CPU is newest and will not read the DST prior to be overwritten.

With ~6000 MB/sec your memcopy is very good
Your limit at the moment is the speed of the 2nd level cache.
You will not be able to increase this using the CPU.
If you use DCBZ the memcopy will not speed up but Bus saturation will go down leaving more free memory bandwith for the SPEs.

What is the amount of memory that your application needs to copy normally?
If you have a free SPE then you can use that one for memcopy.
Because the SPE can do copy memory much faster than the PPC.
You should be able to get ~24MB/sec.

If you want/need to do the copy with the PPC then I would recommend to disable the compiled copy and the check which load instructions the compiler uses.
If you copy big amounts of data this will move all your current data and code out of the CPU cache.
Altivec has load instructions which will limit the cache trashing to 1way of the cache.
The CELL has a 8-way 2nd level cache.
If your code us non trashing load you will keep 7/8 of your cache.

Cheers

Cheers
Gunnar

Top

Profile

Reply with quote

HDe

Post subject:

PostPosted: Fri Mar 14, 2008 11:40 am

Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6

Hello Gunnar,

I have implemented and posted the DCBZ variant together
with a few other changes to the same address, and am now
getting consistently 5830MBytes/sec independent of alignment
and previous memory accesses. Thanks a lot for this
info and the additional suggestions about freeing the
CPU-cache which I have not tried yet.

> What is the amount of memory that your application needs to copy normally?
>If you have a free SPE then you can use that one for memcopy.

We are working on a fast panorama stitcher, which assembles
many (>300) images to gigapixel panoramas; the 6 spus
are busy warping the input images, while the ppu is
responsible for blending. This includes much copying and pasting, and a fast memcpy will surely help. Typical
blocksizes are one pixel row (10kB-200kB). Btw: an already
quite fast preview version can be downloaded from my website
<http://www.fh-furtwangen.de/~dersch/PTS ... .1a.tar.gz>

Thanks again for helping!

Regards

Helmut Dersch

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Sat Mar 15, 2008 2:46 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Hi Helmut,

I see that you added the DCBZ so that its clearing somewhat ahead.

asm volatile ("dcbz 0,%0" : : "r" (&d[16])); // clear dst cache

BTW in my tests it showed that clearing one cache line ahead was too short for the relative high latency of the CELL.
I was used to clear one cache line ahead on other PowerPC systems but on CELL the distance of 1 line was reducing performance instead of improving it. This confused me a lot! I thought for some time that the DCBZ is contra productive on CELL.
After doing some for testing, I realized that at least on my system (IBM Cell Blade) it showed that giving the CELL more time for the DCBZ (using a longer distance) will be positive for performance.

I found that for the systems that I use a good distance is 4 lines ahead.

As more ahead we will go with the DCBZ and DCBT as more time we give the system for this. There is a disadvantage of going to far ahead, as the 8way 2n level cache will at some point loose the prefetched cache because over overwrite other lines that were either cleared or prefetched.

For me it showed putting the DCBZ after the first load can reduce this a little but as it ensures that 1 line gets moved to the 1st level cache before the DCBZ is executed.

Your results may vary but maybe you can check and test
if increasing the DCBZ distance will improve your performance.

For my system it was optimal to prefetch 6 cache lines ahead
(6*128 byte) and to clear 4 cache lines ahead (4*128 byte)

Of course your playstation has a little bit different timings than the IBM blade. The IBM blade has faster memory which increases memcopy performance (I get 7GB/sec with the same scaler code that gets 6GB/Sec on the PS3).
On the other hand you have on IBM blade a numa situation, where one CELL core sometimes wants to work in the memory of the other core. For this case and for the cache cohency protocol of the two CELL chips is important to CLR and PREFECTH far enough ahead.

Do you want to run your code later on blades too?

Cheers
Gunnar

Top

Profile

Reply with quote

HDe

Post subject:

PostPosted: Sat Mar 15, 2008 4:58 am

Offline

Joined: Thu Mar 13, 2008 3:17 am
Posts: 6

>Your results may vary but maybe you can check and test
>if increasing the DCBZ distance will improve your performance.

The values I used are the optimum between 0 and 128
(both for dcbz and dcbt) on the PS3.
Btw: I made a mistake and cleared dest-cache beyond
the destination range. This is corrected now.
I also added a (rather trivial) altivec-memset
(also quite often used in my program), which yields 6000
MBytes/sec compared to 820 MBytes/sec in glibc.

This dcbz/dcbt trick is really neat and I guess
I can use it in a lot more other cases.

>Do you want to run your code later on blades too?

Good idea; I will check that.

Regards

Helmut

Top

Profile

Reply with quote

bbrv

Post subject:

PostPosted: Fri Mar 21, 2008 6:18 pm

Offline

Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1443

...how did it go?

_________________
http://bbrv.blogspot.com

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 5 of 5

[ 73 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum