All times are UTC-06:00




Post new topic  Reply to topic  [ 73 posts ] 
Author Message
PostPosted: Wed Nov 14, 2007 4:43 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hello,

Mostly all Linux applications are developed in C or C++. People often believe that C compiler are good enough to guarentee good performance. This is unfortunately not the case, especially on PowerPC manual optimization can make a huge difference.

Here an example of a memcpy on PowerPC...

a) Normal C routine working on Byte

150 MB/sec

b) Normal C routine working on Long (32bit)

800 MB/sec

c) Normal C routine working on quad (64bit)

1000 MB/sec

** This is best performance that you can archive by algorithm design, using C language **

d) Normal C routine working on quad (64bit) + with two ASM Cache-instruction added.

1380 MB/sec

e) ASM routine better optimized for this PPC architecture

2750 MB/sec

From 150 MB/sec to 2750 MB/sec is quite a difference.

As you can see by using optimized code you can achieve 20 times better performance!

Gunnar


Top
   
 Post subject:
PostPosted: Wed Nov 14, 2007 7:04 am 
Offline

Joined: Tue Feb 14, 2006 2:01 pm
Posts: 75
Location: Germany
could you please provide the code? I'm interested in this.


Top
   
 Post subject:
PostPosted: Wed Nov 14, 2007 8:49 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
FWIW, AltiVec could do even more! Here are the results of the latest memcpy in freevec:
Code:
$ ./bench_memcpy -s -v -t 1 -l 1000000 -M 20000
Test: memcpy
Will do scalar (glibc) tests
Will do vector (AltiVec/VMX) tests
Number of threads : 1
Number of loops : 1000000
Minimum buffer size : 10
Maximum buffer size : 20000

Scalar memcpy()

title = memcpy_scalar, minsize = 10, maxsize = 20000
process = 1/1, size = 10, arraysize = 629145
1/1 10 bytes 0.11 sec 86.70 MB/s 0/0
process = 1/1, size = 20, arraysize = 314572
1/1 20 bytes 0.12 sec 158.95 MB/s 0/0
process = 1/1, size = 40, arraysize = 157286
1/1 40 bytes 0.14 sec 272.48 MB/s 0/0
process = 1/1, size = 80, arraysize = 78643
1/1 80 bytes 0.17 sec 448.79 MB/s 0/0
process = 1/1, size = 160, arraysize = 39321
1/1 160 bytes 0.23 sec 663.43 MB/s 0/0
process = 1/1, size = 320, arraysize = 19660
1/1 320 bytes 0.35 sec 871.93 MB/s 0/0
process = 1/1, size = 640, arraysize = 9830
1/1 640 bytes 0.59 sec 1034.49 MB/s 0/0
process = 1/1, size = 1280, arraysize = 4915
1/1 1280 bytes 1.08 sec 1130.28 MB/s 0/0
process = 1/1, size = 2560, arraysize = 2457
1/1 2560 bytes 2.04 sec 1196.77 MB/s 0/0
process = 1/1, size = 5120, arraysize = 1228
1/1 5120 bytes 3.98 sec 1226.84 MB/s 0/0
process = 1/1, size = 10240, arraysize = 614
1/1 10240 bytes 7.84 sec 1245.62 MB/s 0/0

AltiVec memcpy()

title = memcpy_vector, minsize = 10, maxsize = 20000
process = 1/1, size = 10, arraysize = 629145
1/1 10 bytes 0.07 sec 136.24 MB/s 0/0
process = 1/1, size = 20, arraysize = 314572
1/1 20 bytes 0.11 sec 173.40 MB/s 0/0
process = 1/1, size = 40, arraysize = 157286
1/1 40 bytes 0.12 sec 317.89 MB/s 0/0
process = 1/1, size = 80, arraysize = 78643
1/1 80 bytes 0.11 sec 693.58 MB/s 0/0
process = 1/1, size = 160, arraysize = 39321
1/1 160 bytes 0.12 sec 1271.57 MB/s 0/0
process = 1/1, size = 320, arraysize = 19660
1/1 320 bytes 0.14 sec 2179.83 MB/s 0/0
process = 1/1, size = 640, arraysize = 9830
1/1 640 bytes 0.20 sec 3051.76 MB/s 0/0
process = 1/1, size = 1280, arraysize = 4915
1/1 1280 bytes 0.33 sec 3699.10 MB/s 0/0
process = 1/1, size = 2560, arraysize = 2457
1/1 2560 bytes 0.56 sec 4359.65 MB/s 0/0
process = 1/1, size = 5120, arraysize = 1228
1/1 5120 bytes 1.05 sec 4650.30 MB/s 0/0
process = 1/1, size = 10240, arraysize = 614
1/1 10240 bytes 2.02 sec 4834.47 MB/s 0/0
The code of the new library will RSN be released as I'm working on the final 1.0 version. I had to fix all the bugs so that it can be used at least as an LD_PRELOAD replacement until it gets into glibc proper.


Top
   
 Post subject:
PostPosted: Wed Nov 14, 2007 9:17 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
FWIW, AltiVec could do even more! Here are the results of the latest memcpy in freevec:
I'm not sure if Altivec will make a feelable difference in my case. BTW my numbers were referring to an array size of 16 MB. 16 MB array means I'm really measuring memory speed not cache speed.

But I agree with you that using Altivec has advantages.
E.g With using LVXL load instructions you can minimize cache trashing - which is good of course.

The catch is to know how to use the cache lines on this CPU.
My big speed example of getting close to 3.0 GB/sec was archived by alternative streaming two alternate cache lines.

This is needed to use the memory interface better.
By interleaving more cache lines you can get even higher results up to 6 GB/sec on this PPC-CPU.

The point is that naiv GCC will never do this automatically.

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Wed Nov 14, 2007 9:37 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
I'm not sure if Altivec will make a feelable difference in my case. BTW my numbers were referring to an array size of 16 MB. 16 MB array means I'm really measuring memory speed not cache speed.
I tried running memcpy on 16MB buffers, AltiVec shows no advantage for such big sizes, even with prefetching, basically the bus is saturated all the time.
Quote:
The catch is to know how to use the cache lines on this CPU.
My big speed example of getting close to 3.0 GB/sec was archived by alternative streaming two alternate cache lines.
What kind of cpu is this? I'd be interested to know what tricks you use to achieve this performance.
Quote:
This is needed to use the memory interface better.
By interleaving more cache lines you can get even higher results up to 6 GB/sec on this PPC-CPU.
Also, what kind of benchmark do you do? Do you copy a single 16MB chunk many times and measure the throughput? Or small parts of this 16MB buffer at a time?

Konstantinos


Top
   
 Post subject:
PostPosted: Wed Nov 14, 2007 9:45 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Quote:
I'm not sure if Altivec will make a feelable difference in my case. BTW my numbers were referring to an array size of 16 MB. 16 MB array means I'm really measuring memory speed not cache speed.
I tried running memcpy on 16MB buffers, AltiVec shows no advantage for such big sizes, even with prefetching, basically the bus is saturated all the time.
Yes I agree.
I just tested Altivec and got just a few % better.
Quote:
Quote:
The catch is to know how to use the cache lines on this CPU.
My big speed example of getting close to 3.0 GB/sec was archived by alternative streaming two alternate cache lines.
What kind of cpu is this? I'd be interested to know what tricks you use to achieve this performance.
Cell
Quote:
Quote:
This is needed to use the memory interface better.
By interleaving more cache lines you can get even higher results up to 6 GB/sec on this PPC-CPU.
Also, what kind of benchmark do you do? Do you copy a single 16MB chunk many times and measure the throughput? Or small parts of this 16MB buffer at a time?

Konstantinos
I copy 16 MB one time I measure thet time with using DEC register.


Top
   
 Post subject:
PostPosted: Wed Nov 14, 2007 10:34 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
could you please provide the code? I'm interested in this.
Okay, here is a good memcpy for 32bit PPC cores (G2/G3/G4) with 32byte cache lines
Code:
// To archive optimal memcpy performamce several things are to consider
//
// a) Source data should be prefetched to avoid memory latency bubbles.
// The below routine will use "dcbt" instruction for this.
// b) To improve write speed its adviseable to align the destination properly.
// The destination will be aligned to 32 bit boundary first and for bigger
// copies the destination will be aligned to 32 byte (cache line) boundary

// The copy routine will first align the destination to 16 bit
// Then the copy routine will align the destination to 32 bit
// For copy over a certain size (>=256 byte) we will align the
// destination to a whole cache line of 32 byte
// After the fast cache line copy block we will copy the remaining part
// using 32bit copy and 8bit copies

void *memcpy_ppc32(void *dst, const void *src, size_t size){
uint32 i;
uint32 *src32, *dst32;

if(size<4) goto memcpy_less4; // Tiny copy? No need to align

if( (uint32)dst & 1) { // align destination to 16 bit
*((uint8*)dst++) = *((uint8*)src++); //
size--; //
} //
if ((uint32)dst & 2) { // align destitnation to 32 bit
*((uint16*)dst) = *((uint16*)src); //
src+=2; //
dst+=2; //
size -= 2; //
} //

if(size>=256){ // use cache line, prefetching routine for sizes >=256
while( (uint32)dst & 31) { // align dest to 256 bit (For 32 byte cache line)
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
size -= 4; //
} //

src32 = (uint32 *)src;
dst32 = (uint32 *)dst;

for (i=size/(8*sizeof(uint32));i;i--) { // now we are well aligned and can copy data 32 byte (cache line) wise
asm volatile ("dcbz 0,%0" : : "r" (&dst32[8]));
asm volatile ("dcbt 0,%0" : : "r" (&src32[16]));

dst32[0] = src32[0];
dst32[1] = src32[1];
dst32[2] = src32[2];
dst32[3] = src32[3];
dst32[4] = src32[4];
dst32[5] = src32[5];
dst32[6] = src32[6];
dst32[7] = src32[7];


src32+=8; // 8 x 32 bit words
dst32+=8; //
} //
size &= 8*sizeof(uint32)-1; //
src = (uint8 *)src32;
dst = (uint8 *)dst32;

}

for (i=size/sizeof(uint32);i;i--) { // copy all remaining 32 bit words
*((uint32*)dst) = *((uint32*)src); //
src+=4; //
dst+=4; //
} //
size &= sizeof(uint32)-1; //
memcpy_less4:
while (size--) { // copy all remaining 8 bit words (max 3)
*((uint8*)dst++) = *((uint8*)src++); //
} //
return dst;
}

The G5/970, Power 4/5/6, CELL-PPC have 128 byte cache lines
For good performance in these CPUs the copy needs to be changed to copy 128 byte in a go.

The "dcbz" instruction does help the G2 and G3 CPUs a lot.
For G5/Cell you should remove it to get full speed.

Depending on the memory controller of your CPU you sometimes can get a lot more speed if your loop uses two cache lines per iteration.

The PA-SEMI CPU has a cache line of 64 byte.
For this one you need to adapt your loop size accordingly.

If you use Altivec instruction inside the big cache line copy loop then you can make use of the LVXL instruction which can read memory into the CPU without trashing your 2nd level cache this much. This is sometimes very usefull.

The altivec code would look like this:
This would copy 2 cache lines (64 byte)

" lvxl %%v1,0,%src \n"
" lvxl %%v2,i16,%src \n"
" lvxl %%v3,i32,%src \n"
" lvxl %%v4,i48,%src \n"
" addi %0,%0,64 \n"
" stvx %%v1,0,%1 \n"
" stvx %%v2,i16,%dst \n"
" stvx %%v3,i32,%dst \n"
" stvx %%v4,i48,%dst \n"
" addi %1,%1,64 \n"

i16,i32,i48 are placeholders for register holding the number 16,32,and 48

Please ask if you have more questions.

Cheers
Gunnar


Last edited by gunnar on Mon Dec 10, 2007 10:26 am, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Thu Dec 06, 2007 7:27 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Image

Quick update:

I have looked at the Linux Kernel code a bit.
Its not difficult to improve the performance on PPC.
The Linux Kernel has a copy function which is used to cope between kernel and user space.
As this function copies a lot of data its performance has direct influence on network or filessystem performance.

Improving the speed of it was actuelly easy as you can see:

Image

Image

Especially on the Cell improving this function
does result in feelable performance improvements of the total Linux system.

I'll do some more testing and then publish the patch soon.

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Fri Dec 07, 2007 7:29 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
I'll do some more testing and then publish the patch soon.
Neat! A couple curious questions..

How does this affect CPUs with a 32-byte cache line like the G3 or G4? Does the unrolling to 128-byte chunks affect them in a negative or positive way?

Not technically about the optimization, but can the patch be rolled into a single function (which has dcbz for G2/G3/G4 and not for G5/Cell etc.) using the CPU FEATURE stuff in the kernel (for SMP etc. and AltiVec context switches on single-processor G3, the functions are turned into noops dynamically on startup by looking for magic "markers" in the code) which improves performance across the board?

I am impressed by how that C-language routine speeds stuff up, would it benefit at all with more assembly (i.e. don't trust the compiler, even if the code is simple)? Do you have an output of the gcc generated code (or I'm sure you might have access to one created by the IBM xl compiler :)

I am also surprised that the already-optimized memcpy in the arch/powerpc kernel branch is slower? Am I thinking of the wrong portion of code here?

I'd also love to see the benefits of this on a lowly processor like the e300 core or a Pegasos G3 (I or II) just to see how the improvements take. While optimizing for Cell and G5 is all well and good, if they have a negative impact on slower processors it may delay acceptance of the concept and the patch. We certainly do not want our Efikas to become slower for basic operations..

I know both you and Konstantinos did a lot of great work on this stuff, taking advantage of AltiVec and even explicitly optimizing just to take advantage of PowerPC in general. I remember you both had problems with the GNU libc guys; is there a chance these optimizations will be rolled into the "glibc ports" library as was intended? There is already support for the IBM 44x chips and their strangenesses, but no ready optimizations which just improve performance for further ranges of chips even though they are eminently possible..

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Fri Dec 07, 2007 7:47 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Hi Gunnar, that is a great effort. If you can take a look at this core:

IBM PowerPC 460 cores

Tundra recently licensed that core among others -- Power Up!.

The PowerPC market environment could change dramatically once we see others with licenses bringing these chips on the market.

Keep up the good work!

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
 Post subject:
PostPosted: Fri Dec 07, 2007 10:23 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Quote:
I'll do some more testing and then publish the patch soon.
Neat! A couple curious questions..

How does this affect CPUs with a 32-byte cache line like the G3 or G4? Does the unrolling to 128-byte chunks affect them in a negative or positive way?
Yes the 128 byte routine is No good for G4/G3.

Each CPU group G2/G3/G4,PA-Semi,970,CELL,POWER,
should have their own routine.

How to do this best depends on the situation, I'll try to explain:
We have two cases which we can optimize:

A) Linux
B) glibc

Linux is easy.
We can write on optimized copy for each architecture g2/g3/g4/g5/cell/power ...
The linux copy function would be changed to be only a jump to the real function. We can set this jump on Linux boot up to point to the optimal version. (Sounds like AmigaOS, doesn't it?)


The copy function in the glibc could be solve just the same way. Either with separate libc for each subarchitecture. OR with one lib using a pointer schema just like above explained in Linux

I see a small challenge using the optimal function for static compiled programs.

Quote:
I am impressed by how that C-language routine speeds stuff up, would it benefit at all with more assembly (i.e. don't trust the compiler, even if the code is simple)? Do you have an output of the gcc generated code (or I'm sure you might have access to one created by the IBM xl compiler :)
Yes writing the function in assembly will be better.
By using assemble you can use some optimized hacks to save same compares and you can use the optimal number of registers for the loop for each CPU. (Each CPU has a slightly different LOAD/STORE engine and different number of rename registers - So the optimal copy looks slightly different)

The above mentioned tricks will gives you some improvement above the simple C code.

Quote:
I am also surprised that the already-optimized memcpy in the arch/powerpc kernel branch is slower? Am I thinking of the wrong portion of code here?
The linux copy function that I have measured is the often used function that copies from to userland. There are other functions for copies inside kernel land too.
My explanation to this is that the glibc function
and the kernel where mainly optimized by some IBM developers with focus on the big POWER CPUs.

The Hardware memory prefetching feature of the POWER4 uses probably more transistors than a whole G3 CPU has got. :-)

I feel there is still some room for performance improvement for Efika/G3/G4/CELL system both in glibc and Linux kernel.

Quote:
I'd also love to see the benefits of this on a lowly processor like the e300 core or a Pegasos G3 (I or II) just to see how the improvements take. While optimizing for Cell and G5 is all well and good, if they have a negative impact on slower processors it may delay acceptance of the concept and the patch. We certainly do not want our Efikas to become slower for basic operations..
I think improving the memcpy function by 20%-30% for Efika/Pegasos is no problem.
Quote:
I know both you and Konstantinos did a lot of great work on this stuff, ...
Thanks but actually I did nothing clever here. :-)

My code is only a reuse of what was probably in old Amiga PowerUp libs already before.

I think in Linuxland generally people do put that much energy into performance optimizing as on Amiga. IBM puts energy into optimizing for their Power4 or Power6 CPU. But I think that other PPC CPUs might need different routines was overlooked.


Quote:
I remember you both had problems with the GNU libc guys; is there a chance these optimizations will be rolled into the "glibc ports" library as was intended? There is already support for the IBM 44x chips and their strangenesses, but no ready optimizations which just improve performance for further ranges of chips even though they are eminently possible..
Yes I think we will see this small patch both in glibc and in Linux soon.
I have some good contact with some friendly Linux and glibc hackers. And I know now how to do the commenting, intending of my routines to get easier accepted. :-)


Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Mon Dec 10, 2007 9:33 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
I'd also love to see the benefits of this on a lowly processor like the e300 core
As promised some numbers for EFIKA.
I had a look at the Linux/glibc performance on the EFIKA at the weekend and here are the results:
The performance of the glibc and Linux kernel are compared to a small ASM PPC-copy that I have tuned for the EFIKA CPU.

We can see that there is some room for performance improvement on EFIKA :-)

Image

The Linux cpy functions is used A LOT :
e.g for networking, filesystem etc. This small patch will speed this all up a bit on EFIKA.


Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Mon Dec 10, 2007 9:46 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi BBRV,
Quote:
Hi Gunnar, that is a great effort. If you can take a look at this core:

IBM PowerPC 460 cores
I have no physical access to such a CPU,
but looking at the schematics its looks to me like a typical 603 /e300 core.
As the CPU core seems to share the main characteristics with CPU used in the EFIKA - I would expect that the same performance effects.

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Tue Dec 11, 2007 6:48 pm 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Yes, you are correct.

Thank you Gunnar.

BTW, the 460 is what Tundra licensed: Power Up!

There is change afoot....

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
 Post subject:
PostPosted: Fri Dec 14, 2007 10:08 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
I tweaked the tiny EFIKA memcpy a little bit and got nearly 350Mb/sec now! I have updated the above barchart accordingly.

I don't know how I did this but the routine is now 100% faster than the glibc code. :-)

350 MB/sec is quite nice, its even a bit more than I got out of the AmigaONE-G4.

I wonder if maybe the EFIKA can even do more?
How about doing a little competition here.
I could post my code and people could try to improve it?

Anyone up for the challenge?

Cheers
Gunnar


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 73 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 27 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group