Power Developer :: libfreevec released!

// isolates the rightmost 'true' element of a vector bool char
vector bool char RightSingularChar(vector bool char Mask) {

vector unsigned char pack32_8 = (vector unsigned char)(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12);
vector unsigned char unpack8_32 = (vector unsigned char)(12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15);

// generate bool int mask from bool char mask
vector bool int intMask = vec_cmpgt((vector unsigned int)Mask, vec_splat_u32(0));

// compress; only the rightmost 32 bit element is really significant here
intMask = vec_perm(intMask, intMask, pack32_8);

// isolate rightmost bits of char mask and int mask
intMask = vec_xor(intMask, vec_and(intMask, vec_sub((vector unsigned int)intMask, vec_splat_u32(1))));
Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) vec_sub((vector unsigned int)Mask, vec_splat_u32(1))));

// splat isolated bits
intMask = (vector bool int)vec_cmpeq((vector unsigned char)intMask, vec_splat_u8(1));
Mask = vec_cmpeq((vector unsigned char)Mask, vec_splat_u8(1));

// unpack intMask to full size again
intMask = vec_perm(intMask, intMask, unpack8_32);

// combine Mask and intMask
return vec_and((vector bool char)intMask, Mask);
}

// isolates the rightmost 'true' element of a vector bool char
vector bool char RightSingularChar(vector bool char Mask) {
vector uint8_t pack32_8 = { 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12 };
vector uint8_t unpack8_32 = { 12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15 };
vector uint8_t vone8 = vec_splat_u8(1);
vector uint32_t vzero32 = vec_splat_u32(0), vone32 = vec_splat_u32(1);

// generate bool int mask from bool char mask
vector bool int intMask = vec_cmpgt((vector uint32_t)Mask, vzero32);

// compress; only the rightmost 32 bit element is really significant here
intMask = vec_perm(intMask, intMask, pack32_8);

// isolate rightmost bits of char mask and int mask
vector bool int tempMask = vec_sub((vector uint32_t)intMask, vone32);
intMask = vec_xor(intMask, vec_and(intMask, tempMask));
tempMask = vec_sub((vector uint32_t)Mask, vone32);
Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) tempMask));

// splat isolated bits
intMask = (vector bool int)vec_cmpeq((vector uint8_t)intMask, vone8);
Mask = vec_cmpeq((vector uint8_t)Mask, vone8);

// unpack intMask to full size again
intMask = vec_perm(intMask, intMask, unpack8_32);

// combine Mask and intMask
return vec_and((vector bool char)intMask, Mask);
}

vector bool char vmask = { ff,ff,ff,ff, ff,ff,ff,0, ff,ff,ff,ff, ff,ff,ff,ff };

vector bool char vres = RightSingularChar(vmask);

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ff

Author:	markos [ Mon Sep 26, 2005 12:20 pm ]
Post subject:	libfreevec released!
Hi everyone! I'd like to announce the public release of libfreevec, after a long period of development, frustration and a mix of success/failures stories. After many rewrites, I can finally present something that can at least be used with mostly good results. First, a short info: libfreevec is a free (LGPL) library with hand-optimized replacement routines for GLIBC, such as memcpy(), strlen(), etc. These routines have been written specifically to take advantage of the AltiVec (a.k.a Velocity Engine or VMX), and will only work on processors that include this unit. This means they will not work on older processors, such as 603, 604, 750 (G3) or the POWER family of CPUs. Check the site for more information, and download the source (soon a Debian package will be available): http://freevec.org/ Thanks to Genesi for sponsoring this effort and for creating the ODW, a great development platform! Regards Konstantinos Margaritis PS. For the impatient amongst you, make sure you set CFLAGS="-O3 -maltivec -funroll-loops -DHAVE_ALTIVEC_H" to compile for max performance.

Author:	gunnar [ Mon Sep 26, 2005 12:37 pm ]
Post subject:	Re: libfreevec released!
Great news! Thanks Markos! I'll need to spy on your memcmp code Cheers Gunnar

Author:	markos [ Mon Sep 26, 2005 12:55 pm ]
Post subject:	Re: libfreevec released!
Quote: Great news! Thanks Markos! I'll need to spy on your memcmp code Cheers Gunnar Hehe, i already sent you 2 custom memcmp versions

Author:	bbrv [ Mon Sep 26, 2005 1:55 pm ]
Post subject:	Re: libfreevec released!
GREAT! R&B

Author:	RockmanX [ Mon Sep 26, 2005 11:56 pm ]
Post subject:	Re: libfreevec released!
I wonder if there will be any libfreevec-enabled sources of some big pieces of software... I guess it is already possible to use it, but I'm not a developer, so I'd dare not try to mess up with code... Would something like mplayer take benefits from libfreevec? I apologize for any possible manifestation of ignorance...

Power Developer https://powerdeveloper.org/forums/

libfreevec released! https://powerdeveloper.org/forums/viewtopic.php?f=23&t=387	Page 1 of 2

Author:	lu_zero [ Thu Sep 29, 2005 2:13 am ]
Post subject:	Re: libfreevec released!
the gentoo ebuild is available in portage, you have to properly unmask it in order to emerge it echo "sys-libs/libfreevec -ppc" >>/etc/portage/package.keywords The next week I'll have a look on how will work on mplayer and ffmpeg if nobody else will do before Just to restate: DO NOT PRELOAD IT

Author:	hobold [ Fri Sep 30, 2005 2:37 am ]
Post subject:	Re: libfreevec released!
I worked out a little trick that can be used to scan along a vector, i.e. one can find the leftmost or rightmost 'true' element of a vector of bools. It is based on the observation that the expression "X & (X-1)" clears the rightmost "1" bit. Consequently, "(X & (X-1)) ^ X" isolates that rightmost bit. The trick might be useful for a string compare. Say you found the vector where the two strings differ, but all you have is a vector int mask that is true for the differing elements. Then you could proceed like this: 1. permute the mask into 32 bits of bool char, reversing the order of elements. example permute vector: (0,0,0,0, 0,0,0,0, 0,0,0,0, 12,8,4,0) 2. subtract 1 (32 bit wide elements) 3. AND the results of (1) and (2) 4. XOR the results of (3) and (1) => now the least significant "1" bit is isolated 5. "compare equal" byte-wise the result of (4) with (1,1,1, ...) 6. permute the result of (5) to get a bool int mask in correct order. permute vector: (15,15,15,15, 14,14,14,14, 13,13,13,13, 12,12,12,12) => now we have a vector bool int mask that is set for the leftmost mismatching element 7. the differing vectors of the input strings can now both be ANDed with the mask resulting from (6) 8. compare the masked vectors (with suitable "compare any"), using the scalar boolean return value 9. return two times the result of (8 ) minus 1 for a correct -1/+1 result

Author:	markos [ Fri Sep 30, 2005 8:24 am ]
Post subject:	Re: libfreevec released!
Hey Hobold! Thanks, that looks like a nice way, the nice thing is that it will totally eliminate branching, I'll try to implement it over the weekend, will post here for comments Thanks again! Konstantinos

Author:	hobold [ Fri Sep 30, 2005 12:59 pm ]
Post subject:	Re: libfreevec released!
Something like this has been haunting me ever since I encountered the X & (X-1) trick. It wasn't until today that all pieces fell into place. Here's an exhaustively tested routine generalized to handle vector bool char. It performs the described scan on int elements, and also individually on the four chars of each 32bit lane. The AND of those two is then the overall scan of a vector bool char. For lack of a better name, I call bool vectors with exactly one element set to true "singular masks". I believe they deserve to have a name, because there surely must be a few nice tricks that can be done once you have a singular mask; at least one will follow below. Beware, the example routine scans from right to left, because that is the 'natural' scan order of the underlying method. Byte-reversing permutes are required to scan the other way. Code: // isolates the rightmost 'true' element of a vector bool char vector bool char RightSingularChar(vector bool char Mask) { vector unsigned char pack32_8 = (vector unsigned char)(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12); vector unsigned char unpack8_32 = (vector unsigned char)(12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15); // generate bool int mask from bool char mask vector bool int intMask = vec_cmpgt((vector unsigned int)Mask, vec_splat_u32(0)); // compress; only the rightmost 32 bit element is really significant here intMask = vec_perm(intMask, intMask, pack32_8); // isolate rightmost bits of char mask and int mask intMask = vec_xor(intMask, vec_and(intMask, vec_sub((vector unsigned int)intMask, vec_splat_u32(1)))); Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) vec_sub((vector unsigned int)Mask, vec_splat_u32(1)))); // splat isolated bits intMask = (vector bool int)vec_cmpeq((vector unsigned char)intMask, vec_splat_u8(1)); Mask = vec_cmpeq((vector unsigned char)Mask, vec_splat_u8(1)); // unpack intMask to full size again intMask = vec_perm(intMask, intMask, unpack8_32); // combine Mask and intMask return vec_and((vector bool char)intMask, Mask); } Once you have a singular mask, you can find the index of the 'true' element like this: 1. AND the singular mask with vec_lvsl(0, NULL); 2. vec_sum4s (add bytes across each 32 bit lane) 3. vec_sums (add ints across a vector) the byte index is now in the rightmost element of the result Edit: as usual, my code comes with an 'artistic license': use it as you wish, but credit me, please.

Author:	markos [ Sun Oct 02, 2005 12:52 am ]
Post subject:	Re: libfreevec released!
Quote: Here's an exhaustively tested routine generalized to handle vector bool char. It performs the described scan on int elements, and also individually on the four chars of each 32bit lane. The AND of those two is then the overall scan of a vector bool char. Due to some strange bug in AltiVec optimization in gcc, which makes it consume HUGE amounts of RAM (it ran out of memory on my system, with 1.5GB total memory!!!), I had to rewrite your code a bit, apparently gcc 4.0 on Linux doesn't like inlined altivec intrinsics that much. Code: // isolates the rightmost 'true' element of a vector bool char vector bool char RightSingularChar(vector bool char Mask) { vector uint8_t pack32_8 = { 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12 }; vector uint8_t unpack8_32 = { 12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15 }; vector uint8_t vone8 = vec_splat_u8(1); vector uint32_t vzero32 = vec_splat_u32(0), vone32 = vec_splat_u32(1); // generate bool int mask from bool char mask vector bool int intMask = vec_cmpgt((vector uint32_t)Mask, vzero32); // compress; only the rightmost 32 bit element is really significant here intMask = vec_perm(intMask, intMask, pack32_8); // isolate rightmost bits of char mask and int mask vector bool int tempMask = vec_sub((vector uint32_t)intMask, vone32); intMask = vec_xor(intMask, vec_and(intMask, tempMask)); tempMask = vec_sub((vector uint32_t)Mask, vone32); Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) tempMask)); // splat isolated bits intMask = (vector bool int)vec_cmpeq((vector uint8_t)intMask, vone8); Mask = vec_cmpeq((vector uint8_t)Mask, vone8); // unpack intMask to full size again intMask = vec_perm(intMask, intMask, unpack8_32); // combine Mask and intMask return vec_and((vector bool char)intMask, Mask); } The problem is that I can't get it to work like this: Assuming I pass this mask to this routine: Code: vector bool char vmask = { ff,ff,ff,ff, ff,ff,ff,0, ff,ff,ff,ff, ff,ff,ff,ff }; If I call RightSingularChar on vmask, like this: Code: vector bool char vres = RightSingularChar(vmask); I get this result (if I print vres): Code: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ff Maybe it's my fault in rewriting it, but i tried to make as few changes as possible. Also, in rewriting this code for left-to-right, where should I use the permutes? in the initial mask (tried that, didn't work), or inside the algorithm itself? Konstantinos

Author:	hobold [ Sun Oct 02, 2005 11:20 pm ]
Post subject:	Re: libfreevec released!
The algorithm works exactly as intended in your example: it keeps the rightmost 0xff char element, but clears all other bytes to zero. So to find the 0x00, you'd have to flip all bits (or reverse the sense of the compare that generates the mask). It may be possible to write an analogous routine that isolates rightmost 0x00 values rather than 0xff values ... based on 'X \| (X+1)'. That would save one bit flip. For scanning the other way, just byte-revert the mask before you pass it to the scan routine. Then reverse the resulting singular mask again with the same permute. If you only want to count the index, then you may not need the second byte reversal at the end, but instead use a (15, 14, 13, 12, ...) vector for the counting trick instead of the suggested (0, 1, 2, 3, ...). (The byte-reversed index vector can be had by xor'ing the result of a vec_lvsl(0, NULL) byte-wise with 0x0f.)

Author:	markos [ Mon Oct 03, 2005 2:21 am ]
Post subject:	Re: libfreevec released!
Quote: The algorithm works exactly as intended in your example: it keeps the rightmost 0xff char element, but clears all other bytes to zero. So to find the 0x00, you'd have to flip all bits (or reverse the sense of the compare that generates the mask). It may be possible to write an analogous routine that isolates rightmost 0x00 values rather than 0xff values ... based on 'X \| (X+1)'. That would save one bit flip. For scanning the other way, just byte-revert the mask before you pass it to the scan routine. Then reverse the resulting singular mask again with the same permute. If you only want to count the index, then you may not need the second byte reversal at the end, but instead use a (15, 14, 13, 12, ...) vector for the counting trick instead of the suggested (0, 1, 2, 3, ...). (The byte-reversed index vector can be had by xor'ing the result of a vec_lvsl(0, NULL) byte-wise with 0x0f.) Indeed you were right!! It worked! I've integrated this into memcmp in libfreevec and it seems to really fly!!! I'll add this to the rest of routines that work similarly (memchr, strlen, etc). Many Thanks!!! Konstantinos

Author:	bbrv [ Mon Oct 03, 2005 2:55 am ]
Post subject:	Re: libfreevec released!
Konstantinos, how about some performance charts posted there reflecting the improvement once you get these integrated? Great! Thanks Holger! R&B

Author:	hobold [ Mon Oct 03, 2005 7:41 am ]
Post subject:	Re: libfreevec released!
If you want to scan for 0x00, I can work out the respective routine as well. It may save only a single instruction, but every little bit helps. BTW, your version of the routine looks a lot nicer than mine and reads more clearly. I've been too hasty in throwing this together once I saw how it worked. I don't have an overview what types of other 'scan' or 'collect' operations might be useful to you, so if you have specific wishes, bring them on! Maybe there are more neat little tricks that can be applied. Such tricks, if they exist, should particularly help short strings. With any luck, the vectorized routine can then beat the scalar code across the board, not just for large data sets.

Page 1 of 2	All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/

Author:	lu_zero [ Tue Oct 04, 2005 6:13 am ]
Post subject:	Re: libfreevec released!
Quote: Due to some strange bug in AltiVec optimization in gcc, which makes it consume HUGE amounts of RAM (it ran out of memory on my system, with 1.5GB total memory!!!), I had to rewrite your code a bit, apparently gcc 4.0 on Linux doesn't like inlined altivec intrinsics that much. the gcc intrinsics are quite large macros, nesting them will exponentially consume memory on the preprocessing step. My rule of thumb is to just have at most 2 level of nesting to avoid that side effect. The compilation step should remove most of the temp variables (hopefully); lu