Quote:
Wow! That is great!
In the meantime, why not start at the low end with some transformations that make heavy use of vperm for table lookups or other clever uses other than simply processing masses of data at once?
Algorithms already tuned to AltiVec that you can find on Google are.. the minor stuff like base64 encoding and decoding, and there are
fast AES implementations which use *h u g e* tables. You may not be speeding up anything but the table lookup :)
However the full unrolled version is quite a neat thing; every part of the rounds are as follows;
Code:
s0 = GETU32(pt ) ^ rk[0];
s1 = GETU32(pt + 4) ^ rk[1];
s2 = GETU32(pt + 8) ^ rk[2];
s3 = GETU32(pt + 12) ^ rk[3];
t0 = Te0[s0 >> 24] ^ Te1[(s1 >> 16) & 0xff] ^ Te2[(s2 >> 8) & 0xff] ^ Te3[s3 & 0xff] ^ rk[ 4];
t1 = Te0[s1 >> 24] ^ Te1[(s2 >> 16) & 0xff] ^ Te2[(s3 >> 8) & 0xff] ^ Te3[s0 & 0xff] ^ rk[ 5];
t2 = Te0[s2 >> 24] ^ Te1[(s3 >> 16) & 0xff] ^ Te2[(s0 >> 8) & 0xff] ^ Te3[s1 & 0xff] ^ rk[ 6];
t3 = Te0[s3 >> 24] ^ Te1[(s0 >> 16) & 0xff] ^ Te2[(s1 >> 8) & 0xff] ^ Te3[s2 & 0xff] ^ rk[ 7];
You can load in the pt indices to a vector (contiguous) and then the rk elements (which are contiguous anyway, too!) into another, and vec_xor.
This gives you a vector with {s0,s1,s2,s3} which will come in handy. You can create a mask of 24, 16, 8 and a mask of all 0xff with some tricks, easily.
The shift right and can be applied to all of the elements after the permute.
Then you have to do a table lookup and load 4 elements from these locations in memory.. the rk[] elements you can cheat with since they are 16 bytes contiguous in memory :)
Then you have to xor them together which is going to be a major pain because you can't do that in AltiVec, however.. couldn't the data be loaded in twice and then permuted into the 4th element each time (i.e. shift each 32-bit element down and xor and then only care about that element) then store that element?