Quote:
Quote:
I'll do some more testing and then publish the patch soon.
Neat! A couple curious questions..
How does this affect CPUs with a 32-byte cache line like the G3 or G4? Does the unrolling to 128-byte chunks affect them in a negative or positive way?
Yes the 128 byte routine is No good for G4/G3.
Each CPU group G2/G3/G4,PA-Semi,970,CELL,POWER,
should have their own routine.
How to do this best depends on the situation, I'll try to explain:
We have two cases which we can optimize:
A) Linux
B) glibc
Linux is easy.
We can write on optimized copy for each architecture g2/g3/g4/g5/cell/power ...
The linux copy function would be changed to be only a jump to the real function. We can set this jump on Linux boot up to point to the optimal version. (Sounds like AmigaOS, doesn't it?)
The copy function in the glibc could be solve just the same way. Either with separate libc for each subarchitecture. OR with one lib using a pointer schema just like above explained in Linux
I see a small challenge using the optimal function for static compiled programs.
Quote:
I am impressed by how that C-language routine speeds stuff up, would it benefit at all with more assembly (i.e. don't trust the compiler, even if the code is simple)? Do you have an output of the gcc generated code (or I'm sure you might have access to one created by the IBM xl compiler :)
Yes writing the function in assembly will be better.
By using assemble you can use some optimized hacks to save same compares and you can use the optimal number of registers for the loop for each CPU. (Each CPU has a slightly different LOAD/STORE engine and different number of rename registers - So the optimal copy looks slightly different)
The above mentioned tricks will gives you some improvement above the simple C code.
Quote:
I am also surprised that the already-optimized memcpy in the arch/powerpc kernel branch is slower? Am I thinking of the wrong portion of code here?
The linux copy function that I have measured is the often used function that copies from to userland. There are other functions for copies inside kernel land too.
My explanation to this is that the glibc function
and the kernel where mainly optimized by some IBM developers with focus on the big POWER CPUs.
The Hardware memory prefetching feature of the POWER4 uses probably more transistors than a whole G3 CPU has got. :-)
I feel there is still some room for performance improvement for Efika/G3/G4/CELL system both in glibc and Linux kernel.
Quote:
I'd also love to see the benefits of this on a lowly processor like the e300 core or a Pegasos G3 (I or II) just to see how the improvements take. While optimizing for Cell and G5 is all well and good, if they have a negative impact on slower processors it may delay acceptance of the concept and the patch. We certainly do not want our Efikas to become slower for basic operations..
I think improving the memcpy function by 20%-30% for Efika/Pegasos is no problem.
Quote:
I know both you and Konstantinos did a lot of great work on this stuff, ...
Thanks but actually I did nothing clever here. :-)
My code is only a reuse of what was probably in old Amiga PowerUp libs already before.
I think in Linuxland generally people do put that much energy into performance optimizing as on Amiga. IBM puts energy into optimizing for their Power4 or Power6 CPU. But I think that other PPC CPUs might need different routines was overlooked.
Quote:
I remember you both had problems with the GNU libc guys; is there a chance these optimizations will be rolled into the "glibc ports" library as was intended? There is already support for the IBM 44x chips and their strangenesses, but no ready optimizations which just improve performance for further ranges of chips even though they are eminently possible..
Yes I think we will see this small patch both in glibc and in Linux soon.
I have some good contact with some friendly Linux and glibc hackers. And I know now how to do the commenting, intending of my routines to get easier accepted. :-)
Cheers
Gunnar