All times are UTC-06:00




Post new topic  Reply to topic  [ 12 posts ] 
Author Message
PostPosted: Sat Apr 19, 2008 5:50 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Image

Matrix inversion is one of the most time consuming routines in 3D -and other math-related- applications. I managed to beat the scalar version by ~300% and also beat the highly optimized routine found in cellperformance.com by a small margin!

For detailed math & code analysis and benchmarks, please see

http://www.freevec.org/function/inverse ... rtitioning

I have a couple more matrix functions to optimize and then I will start on integrating these to Mesa/OpenGL.


Top
   
PostPosted: Mon Apr 21, 2008 1:33 am 
Offline

Joined: Mon Jan 08, 2007 3:40 am
Posts: 195
Location: Pinto, Madrid, Spain
Quote:
I managed to beat the scalar version by ~300% and also beat the highly optimized routine found in cellperformance.com by a small margin!
Konstantinos, I don't understand a word of this magic you show to us, but it looks impressive, and for sure is a big step forward for Altivec applications. I hope you can find more people that are able to understand your work, so their congratulations are more worth than mine. Translating this mathematical expressions into computer language is indeed work of art.

Congratulations!


Top
   
PostPosted: Mon Apr 21, 2008 8:54 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Quote:
I managed to beat the scalar version by ~300% and also beat the highly optimized routine found in cellperformance.com by a small margin!
Konstantinos, I don't understand a word of this magic you show to us, but it looks impressive, and for sure is a big step forward for Altivec applications. I hope you can find more people that are able to understand your work, so their congratulations are more worth than mine. Translating this mathematical expressions into computer language is indeed work of art.

Congratulations!
Hi Juan,

thanks for your kind words it really is nice to be appreciated :)
I have to say though, it may be all math and uninteresting to many, but when it will get integrated eg. into Mesa, I think many people will appreciate it even more :D
Stay tuned :)


Top
   
PostPosted: Mon Apr 21, 2008 12:47 pm 
Offline

Joined: Tue Jan 31, 2006 1:18 am
Posts: 49
Location: Bialystok, Poland
http://www.freevec.org/function/inverse ... rtitioning

Interesting article. I have one comment to it - you use a permutation mask on the last stage of calculation. If loading this mask from memory hurts performance, it can be easily generated algorythmically Here are the steps:

1. Generate a word with all zeros (splat operator).
2. Generate a word with all eights (splat too).
3. Make a word with 8 zeros followed by 8 eights (sld operator).
4. Generate a linearly increasing mask [0, 1, 2, 3, 4...]. I do not remember operator name now (have no PDF at hand).
5. Add words from 3. and 4.

It is 5 AltiVec instructions. I expect it to be faster on G4 even if mask is retrieved from L1 cache.

_________________
http://krashan.ppa.pl


Top
   
PostPosted: Mon Apr 21, 2008 4:20 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
1. Generate a word with all zeros (splat operator).
2. Generate a word with all eights (splat too).
3. Make a word with 8 zeros followed by 8 eights (sld operator).
4. Generate a linearly increasing mask [0, 1, 2, 3, 4...]. I do not remember operator name now (have no PDF at hand).
5. Add words from 3. and 4.

It is 5 AltiVec instructions. I expect it to be faster on G4 even if mask is retrieved from L1 cache.
Konstantinos, is this the same thing you were asking on IRC about!?

Amazing stuff anyway. It's kind of staggering what you guys can come up with if you're working on the same code..

I wonder if we could get you guys to cooperate on this more. I think one of the better things about MacOS is the provisions of things like vecLib etc. - FIR/FFT, linear algebra all in a standard system library. If that library was available under Linux (and, why not MorphOS too?) with some Apple-compatibility stubs it could really benefit developers who need instant, speedy ways to optimize their applications.. Grzegorz you have all the experience in FIR :)

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Tue Apr 22, 2008 12:26 am 
Offline
Genesi

Joined: Mon Jan 30, 2006 2:28 am
Posts: 409
Location: Finland
Hi all.

First of all, great work! This is really cool stuff.

Secondly, what IRC channel are you guys in? I want in on the fun as soon as I have some more time (in a couple of weeks). Creating a library like Matt pointed out makes sense doing. I'd like to help. I'd also like to expand on this for cryptographic applications.


Best regards,
Johan

_________________
Johan Dams, Genesi USA Inc.
Director, Software Engineering

Yep, I have a blog... PurpleAlienPlanet


Top
   
PostPosted: Tue Apr 22, 2008 2:57 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
http://www.freevec.org/function/inverse ... rtitioning

Interesting article. I have one comment to it - you use a permutation mask on the last stage of calculation. If loading this mask from memory hurts performance, it can be easily generated algorythmically Here are the steps:

1. Generate a word with all zeros (splat operator).
2. Generate a word with all eights (splat too).
3. Make a word with 8 zeros followed by 8 eights (sld operator).
4. Generate a linearly increasing mask [0, 1, 2, 3, 4...]. I do not remember operator name now (have no PDF at hand).
5. Add words from 3. and 4.

It is 5 AltiVec instructions. I expect it to be faster on G4 even if mask is retrieved from L1 cache.
Hi,
I tried sth in the same lines (4 instructions more, I already had v0 splat), as I thought that it would be faster to create the mask instead of loading it. But strangely enough, the code became a bit slower (1.47 sec for 10000000 loops instead of 1.37 sec). I can't really explain why this happened, but I reverted loading the mask instead of creating it.


Top
   
 Post subject:
PostPosted: Tue Apr 22, 2008 3:08 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Hi all.

First of all, great work! This is really cool stuff.

Secondly, what IRC channel are you guys in? I want in on the fun as soon as I have some more time (in a couple of weeks). Creating a library like Matt pointed out makes sense doing. I'd like to help. I'd also like to expand on this for cryptographic applications.


Best regards,
Johan
Hi Johan,

Cryptography is one of the areas that is mostly left untouched wrt to AltiVec optimizations, and this is strange since it was built for such stuff (sure there are a few routines optimized but compared to the multitude of encryption/compression routines out there this is nothing).

But from one point of view this is understandable as it is quite hard for someone to vectorize such an algorithm, and it is not always obvious that it is indeed possible (eg md5sum is not directly possible to vectorize and yet I know that some genious developer in Freescale has actually done that, vectorized the unvectorizable!). Still there are many more algorithms that are easier to tackle (I for one have given zlib 25% speed increase in decompression by tackling only 2 out of 6 time consuming functions).

What I would really really like to see is a reference ppc-oriented distro that really shows the platform's potential. Something that will not have to carry the usual baggage but be slim, optimized and only offer the grounds for derivative platforms.
But it should carry optimized versions of libc, libmcrypt, openssl, mesa3d, etc.

I've done some thinking on this and it's doable and not even that difficult (we don't have to reinvent the wheel, we could just base our work on sth existing), but we should refrain from trying to have everything under the sun like most distros do :)

As for the IRC channels, I'm on Freenode as markos_ and in quite a few channels, so if you're around just /query and we'll continue the discussion in a channel :)


Top
   
PostPosted: Tue Apr 22, 2008 3:27 am 
Offline

Joined: Mon Jan 08, 2007 3:40 am
Posts: 195
Location: Pinto, Madrid, Spain
Quote:
you use a permutation mask on the last stage of calculation. If loading this mask from memory hurts performance, it can be easily generated algorythmically. It is 5 AltiVec instructions.
Genius! More! more!


Top
   
 Post subject:
PostPosted: Tue Apr 22, 2008 3:34 am 
Offline
Genesi

Joined: Mon Jan 30, 2006 2:28 am
Posts: 409
Location: Finland
Hi
Quote:
Cryptography is one of the areas that is mostly left untouched wrt to AltiVec optimizations, and this is strange since it was built for such stuff (sure there are a few routines optimized but compared to the multitude of encryption/compression routines out there this is nothing).
It's just that Cryptography is my main research interest, especially applied to embedded systems. It would be nice to have an embedded chip in the future with Altivec unit to provide fast secure communication in a wireless sensor network for instance.
My main research is in Elliptic Curve cryptography, where, as an example, the Montgomery Modular Multiplication can be vectorized. In the non-ECC field, AES would be a prime candidate to test on the Altivec.


Johan.

_________________
Johan Dams, Genesi USA Inc.
Director, Software Engineering

Yep, I have a blog... PurpleAlienPlanet


Top
   
 Post subject:
PostPosted: Tue Apr 22, 2008 3:53 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Hi
It's just that Cryptography is my main research interest, especially applied to embedded systems. It would be nice to have an embedded chip in the future with Altivec unit to provide fast secure communication in a wireless sensor network for instance.
My main research is in Elliptic Curve cryptography, where, as an example, the Montgomery Modular Multiplication can be vectorized. In the non-ECC field, AES would be a prime candidate to test on the Altivec.


Johan.
Wow! That is great! That was my drawback since I haven't studied any cryptography and i had to do the reading from scratch, but if you can help with the actual algorithms I would be glad to work on the altivec implementation! Whenever you feel like join IRC and we could start asap! :D


Top
   
 Post subject:
PostPosted: Tue Apr 22, 2008 7:28 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Wow! That is great!
In the meantime, why not start at the low end with some transformations that make heavy use of vperm for table lookups or other clever uses other than simply processing masses of data at once?

Algorithms already tuned to AltiVec that you can find on Google are.. the minor stuff like base64 encoding and decoding, and there are fast AES implementations which use *h u g e* tables. You may not be speeding up anything but the table lookup :)

However the full unrolled version is quite a neat thing; every part of the rounds are as follows;
Code:
s0 = GETU32(pt ) ^ rk[0];
s1 = GETU32(pt + 4) ^ rk[1];
s2 = GETU32(pt + 8) ^ rk[2];
s3 = GETU32(pt + 12) ^ rk[3];
t0 = Te0[s0 >> 24] ^ Te1[(s1 >> 16) & 0xff] ^ Te2[(s2 >> 8) & 0xff] ^ Te3[s3 & 0xff] ^ rk[ 4];
t1 = Te0[s1 >> 24] ^ Te1[(s2 >> 16) & 0xff] ^ Te2[(s3 >> 8) & 0xff] ^ Te3[s0 & 0xff] ^ rk[ 5];
t2 = Te0[s2 >> 24] ^ Te1[(s3 >> 16) & 0xff] ^ Te2[(s0 >> 8) & 0xff] ^ Te3[s1 & 0xff] ^ rk[ 6];
t3 = Te0[s3 >> 24] ^ Te1[(s0 >> 16) & 0xff] ^ Te2[(s1 >> 8) & 0xff] ^ Te3[s2 & 0xff] ^ rk[ 7];
You can load in the pt indices to a vector (contiguous) and then the rk elements (which are contiguous anyway, too!) into another, and vec_xor.

This gives you a vector with {s0,s1,s2,s3} which will come in handy. You can create a mask of 24, 16, 8 and a mask of all 0xff with some tricks, easily.

The shift right and can be applied to all of the elements after the permute.

Then you have to do a table lookup and load 4 elements from these locations in memory.. the rk[] elements you can cheat with since they are 16 bytes contiguous in memory :)

Then you have to xor them together which is going to be a major pain because you can't do that in AltiVec, however.. couldn't the data be loaded in twice and then permuted into the 4th element each time (i.e. shift each 32-bit element down and xor and then only care about that element) then store that element?

_________________
Matt Sealey


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 12 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group