Power Developer
https://powerdeveloper.org/forums/

Altivec update...
https://powerdeveloper.org/forums/viewtopic.php?f=23&t=142
Page 1 of 1

Author:  markos [ Mon Feb 07, 2005 8:44 am ]
Post subject:  Altivec update...

(pasting this from an email due to high demand :-)

So, sorry for the delay, but I was at a Debian-Edu/Skolelinux meeting
in Oslo, Norway, and I just came back to Greece.

So, to fill you all in what I'm doing right now wrt the altivec
vectorization of Debian/Linux in general:

* I have right now a pretty generic way for applications to use a
single binary for both vectorized and scalar code. This means the
same program will autodetect Altivec if present and use the optimized
code. This is for applications/executables only. It is actually a known way, but it tried to take it further by using function pointers to overload the functions I want to use...
* For the same thing to happen in a library (e.g. glibc), it's much
more complicated to do in a proper and consistent way, but I'm
looking into it, and I'm in constant communication with people more
knowledgeable than I am in these things.
* On the matter of optimizations, I've been doing optimized versions
of common routines that are used in many common programs (i've
started with the package coreutils in debian), and also I began
working with the vectorization of encryption routines (those that can
be parallelizable at least). I've made an altivec version of MD5sum
algorithm but due to the nature of this particular algorithm (it's
not parallelizable at all), there is no speed advantage. But there
are other routines (in library libmcrypt) which are parallelizable,
though the process of vectorizing these algorithms is not an easy
thing to do, and i have to ensure that
* Once I've found how to do altivec auto-detection and optimization in
libraries, it will be a simple thing to include the available
optimizations in glibc and other libraries.
* After basic optimizations are in place, we can work more on
optimizing applications like MySQL, PostgreSQL, apache, etc.
* I can definitely say from my benchmarks so far that the minimum
speed increase will be from 400% (in cases where vectorization is
difficult) to 3200%!! The average is about 1000% (10x). This will be
directly visible to the user in applications that require a lot of
processing power and I'm pretty sure that this is what everyone wants to see eventually :-)
One thing I can say for sure, is that when this thing hits the street,
it will take the market by storm and will totally change the way
people view PowerPC as a potential Linux platform. Esp. for servers,
given the fact that a simple G4 board will probably outperform or at
the very least actively compete a dual Xeon in MySQL/Apache
performance!
* I'm thinking of proposing a website/section in some existing
website, where people will vote on some
applications/routines/libraries they want vectorized and the altivec
developers will work on the optimizations.
* [upd] Once I've tidied up the code/patches, I'll commit it to the pegasos CVS repository on alioth.

Of course, the only problem with this is that I have to work on this
on my spare time, and the past 3 months have been quite busy. This
was the only reason I have not had time to post to penguinppc.org.
The fact I'm also working on a Debian-edu project here in Greece is
also a factor that takes quite a lot of my time.
But I'm confident I can have some results at least on some core
packages of Debian in the next couple of months.

Regards

Konstantinos

Author:  hobold [ Tue Feb 08, 2005 5:09 am ]
Post subject:  Re: Altivec update...

So the examples were convincing. :)

Feel free to ask any and all questions that come up during optimization. I don't think I will write real code, but I will gladly share everything I know about algorithms, mathematical simplifications, the low downs of the pipeline, compiler pitfalls and whatnot.

If I could empower you and other actual contributors with the necessary knowledge, I think I'd have achieved something more significant than by vectorizing a few important routines. (Maybe you have heard of the analogy: give a man a fish to feed him for a day; but give him a fishing pole to keep him fed.)

Holger

Author:  markos [ Thu Feb 10, 2005 3:53 am ]
Post subject:  Re: Altivec update...

Quote:
So the examples were convincing. :)

Feel free to ask any and all questions that come up during optimization. I don't think I will write real code, but I will gladly share everything I know about algorithms, mathematical simplifications, the low downs of the pipeline, compiler pitfalls and whatnot.
Actually I do have a question that has been bothering me...
Hm, Here goes: I'd like to know how I could skip branch evaluation so that I could save myself the cpu cycles to execute the "if" statement. I'll explain:

For very small sizes, altivec is useless and it's even more slower than the scaler version. Suppose I have the routine memset(), where the altivec one is not very efficient at small sizes compared to the scalar one.

I found that it's much better to put an if case like that:
Code:
if (size < N)
// run scalar code
else
// run altivec code.
where N must be found specifically on a per-routine cases with benchmarking and optimisation. Eg, for memset(), N is between 16-32. The thing is that even with this approach, memset() is still slightly slower than the scalar version (because of the if check). Is there some clever way to avoid these extra lost cycles?

Also, I was thinking of trying to optimize the scalar loop for the N bytes even more. Again I'll explain:

instead of doing sth like:
Code:
char *buf, c;
for (i=0; i < size; i++) { // size < N
buf = c;
}


I was wondering if maybe this loop would be made faster, eg by spliting into longs, then into words and then the remainining bytes.
Could Altivec help in this case? would it be fast enough?
Mind you, I'd prefer if I would not use asm. I prefer C, it's fast enough and it's much more easily maintainable, plus I don't know asm :-)

Hope you have some nice hints on these "problems" :-)
Konstantinos

Author:  hobold [ Fri Feb 11, 2005 4:18 am ]
Post subject:  Re: Altivec update...

Quote:
The thing is that even with this approach, memset() is still slightly slower than the scalar version (because of the if check). Is there some clever way to avoid these extra lost cycles?
It is possible to reduce the impact of the check somewhat, but the solution is unwieldy. The basic idea is to have a special version of memcpy() that collects statistics on the parameters it is called with from any particular caller.

With this, one can do a profiling run and eventually classify all callers of memcpy into three categories: a) almost always small size => use scalar variant unconditionally, b) almost always large size (i.e. above threshold) => use vector variant unconditionally (but make sure it handles small sizes correctly), c) strongly varying sizes => use conditional variant.

This is quite a bit of work, so I would recommend it only after you have harvested other low hanging fruit. The other drawback is that this is based on profile directed feedback, so you have to be reasonably confident that statistics are similar over a large range of possible uses of the software.
Quote:
Also, I was thinking of trying to optimize the scalar loop for the N bytes even more. Again I'll explain:

instead of doing sth like:
Code:
char *buf, c;
for (i=0; i < size; i++) { // size < N
buf = c;
}


I was wondering if maybe this loop would be made faster, eg by spliting into longs, then into words and then the remainining bytes.
This has the same problem that it saves computational instructions, but requires more branches. In other words, the only way to find out if this saves time is to try it, because ultimately the speed depends on the predictability (or lack thereof) of the branches.
Quote:
Could Altivec help in this case? would it be fast enough?

AltiVec can help. But again it takes a lot of conditional branches to detect the particular special case. So maybe you shouldn't try to be too smart, but use a brute force strategy instead ... illustration for the case where you want to copy less than 16 bytes:
Code:

// fetch vectors containing first and last source byte
vector unsigned char data1 = vec_ld(0, source);
vector unsigned char data2 = vec_ld(size - 1, source);

// prepare alignment of source and destination
vector unsigned char alignsrc = vec_lvsl(0, source);
vector unsigned char aligndest = vec_lvsr(0, destination);

// left align source data
data1 = vec_perm(data1, data2, alignsrc);
// align data to destination address
data1 = vec_perm(data1, data1, aligndest);

// write destination byte by byte
for (int i = size - 1; i >= 0; i--) {
vec_ste(data1, i, destination);
}

As usual, the above code is untested (just typed in the browser), so beware. If you don't mind code bloat, you can also try the following to replace the loop:
Code:

switch(size) {
case 15:
vec_ste(data1, 15, destination);
case 14:
vec_ste(data1, 14, destination);
case 13:
vec_ste(data1, 13, destination);
case 12:
vec_ste(data1, 12, destination);

...

}

This is more expensive than it looks like, depending on how the compiler chooses to implement the switch. There is also a hidden cost in each vec_ste, because the offset parameter cannot be an immediate constant, but has to reside in a GPR. So the values 15, 14, 13, ... will need extra instructions to be loaded into registers.

It is possible to overengineer the switch variant further: in case you know the destination alignment at compile time (or by distinguishing between yet more special cases), you can use "store element" type instructions of 16 and 32 bit operand width. This would reduce the number of store instructions by up to a factor of four. But I can't imagine there's much of a gain after the fairly deep branch tree needed to arrive at the correct special case.
Quote:
Mind you, I'd prefer if I would not use asm. I prefer C, it's fast enough and it's much more easily maintainable, plus I don't know asm :-)
Don't worry, in case of AltiVec, assembly usually gains very little over well crafted C, so it is very rarely a good idea to resort to asm.

Oops, after writing the above I realized that you talked about memset, not memcpy. So obviously there is no need to read source data and align it. You merely want to transfer a single byte into a vector register and splat it there, just like in the earlier examples.

Author:  hobold [ Fri Feb 11, 2005 4:39 am ]
Post subject:  Re: Altivec update...

Hmm on second thought, memset never has to care about alignment, so maybe it would pay to try and make the overengineered vec_ste trick work ... hmm, that isn't so simple ... and you could as well do it with scalar code rather than AltiVec:
Code:
// replicate the data byte four times in an int
// (if the compiler is smart, this takes two bitfield instructions)
data |= data << 8;
data |= data << 16;

// in case of odd address, write single byte to make it even
if ((size >= 1) && (1 == (long(destination) & 1))) {
((char*)destination)[0] = char(data);
destination += 1;
size -= 1;
}

// in case of address indivisible by four, write a two-byte quantity
if ((size >= 2) && (2 == (long(destination) & 2))) {
((short*)destination)[0] = short(data);
destination += 2;
size -= 2;
}

// loop, writing four bytes in each iteration as an int
while (size >= 4) {
((int*)destination)[0] = int(data);
destination += 4;
size -= 4;
}

// if there are at least two bytes left to write, store a short
if (size >= 2) {
((short*)destination)[0] = short(data);
destination += 2;
size -= 2;
}

// there might still be one byte left to be written
if (size >= 1) {
((char*)destination)[0] = char(data);
}
Eight byte quantities could be written in the form of floating point double precision values (or 64 bit longs on PPC64), and sixteen byte quantities could be written as vectors.

In any case, there are quite a few quite complicated checks, so it is not clear to me that this is a win. Beware that benchmark loops for testing purposes might have much more regular branch patterns than real code. Don't let artificial timing tests fool you with branches.

Author:  MeisterTom [ Sat Feb 12, 2005 10:31 am ]
Post subject:  Re: Altivec update...

Wohoo, Vector Debian sounds promising !

Author:  Neko [ Tue Feb 15, 2005 8:15 am ]
Post subject:  Re: Altivec update...

One thing that validates any code you have and how fast it might turn out to be in the pipeline (notwithstanding the memory subsystem or cache) is to code it, and then run it through SimG4+.

It is on every ODW, on the Debian installation along with a lot of example code - unfortunately I think the version we shipped timed out at Christmas, and a lot of people have simply installed another Linux over the top too.

Hopefully those tools will be available to you in the very near future so you can resume using them and optimising for the G4 :)

Back to my point; regardless of using profiling tools like SimG4+ and the Performance Counter Kernel Module, there are other features of the G4 which come in handy when thinking about memory accessing. One of them is the Store Gathering feature, which combines writes of bytes, shorts and longs in pairs, in order to make better use of memory bandwidth.

If we consider hobold's code, then your entire memory copy/set function is basically a loop of 16-byte stores from the vector unit to memory. When you finally get to the end you will either have finished, or have anything up to 15 bytes left to write, but the branching and conditionals to check what kind of store to make is going to kill your performance. Instead you could just test to see if doing simple byte writes (or perhaps shorts if you could handle the potential "last byte" and risk another branch) and let the SG feature handle it for you?

Neko

Author:  hobold [ Wed Feb 16, 2005 3:51 am ]
Post subject:  Re: Altivec update...

Store gathering only kicks in on cache misses, it is more correctly called store miss merging. The biggest benefit can only be had when a complete cache block has been gathered _before_ the first transaction on the bus has been initiated. In such a particularly favourable case, the gathered data can be written out to memory directly, rather than being merged with data that has to be read in from memory first (you have to combine old and new data of the cache block).

It takes time to merge stores, so if you have 32 individual byte stores, the first store miss will probably initiate its memory read transaction before the 32nd byte is stored. Storing a pair of vectors is the only reasonably reliable way to have store miss merging eliminate the unnecessary memory read access.

Of course, in any particular special case things might work better than my scenario above, so using SimG4+ would be a good way to find out what actually happens.

Author:  bbrv [ Fri Feb 18, 2005 1:33 am ]
Post subject:  Re: Altivec update...

@markos, hobold, this discussion is truly fantastic. We hope many can develop around /on the promise of this effort. GREAT! The new CPUs coming out on future versions of the Pegasos will also benefit and keep this platform relevant for a long time => low power, low temperature, high reliability, and now super altivecerated! :-)

R&B :-)

Author:  magnetic [ Fri Feb 18, 2005 12:45 pm ]
Post subject:  Re: Altivec update...

Excellent forum guys..

This kind of thing is exactly what the PPC community needs. The best aspect of the G4 is the Altivec unit and you are illustrating why. :)

@bbrv

I dont think that term you coined is in danger of being put on any ad materials for Freescale.. :twisted:

magnetic

Author:  markos [ Sun Feb 27, 2005 7:25 am ]
Post subject:  Re: Altivec update...

One more thing...

I'd like to do the following but I haven't found anything related... So far, all examples I've seen with Altivec were about applying the same operation over all elements. Is it possible to do one operation but on the elements as a group? Allow me to illustrate, what I want to do:

Suppose I have these 1-byte elements:

| A0| A1| A2| A3| A4| A5| A6| A7| A8| A9| A10| A11| A12| A13| A14| A15|

Now, I'd like to shift/rotate ALL elements by one position, which means, I'd like to have one of the results:

| - | A0| A1| A2| A3| A4| A5| A6| A7| A8| A9| A10| A11| A12| A13| A14|

or

| A15| A0| A1| A2| A3| A4| A5| A6| A7| A8| A9| A10| A11| A12| A13| A14|

Is this possible with Altivec? It would make things sooo much simpler with what I'm trying to do now :-)
I suppose I could use permutation, but I thought I'd ask for a cheaper (in terms of cpu cycles) solution.

Thanks again,

Konstantinos

Author:  Neko [ Sun Feb 27, 2005 10:21 pm ]
Post subject:  Re: Altivec update...

Quote:
One more thing...

Suppose I have these 1-byte elements:

| A0| A1| A2| A3| A4| A5| A6| A7| A8| A9| A10| A11| A12| A13| A14| A15|

Now, I'd like to shift/rotate ALL elements by one position, which means, I'd like to have one of the results:

| - | A0| A1| A2| A3| A4| A5| A6| A7| A8| A9| A10| A11| A12| A13| A14|

or

| A15| A0| A1| A2| A3| A4| A5| A6| A7| A8| A9| A10| A11| A12| A13| A14|

Is this possible with Altivec?
Vector Permute. That's all you need to know :)

Neko

Author:  hobold [ Mon Feb 28, 2005 1:51 am ]
Post subject:  Re: Altivec update...

If the shift amount is invariant, you can use vec_sld (it is mapped to an opcode that takes an immediate value). If the shift amount is variable and residing in a vector register, use vec_slo/vec_sro. If the shift amount is variable and residing in a scalar register, you'll have to use vec_lvsl/vec_lvsr followed by permute.

Page 1 of 1 All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/