Power Developer • libfreevec released!

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

libfreevec released!

Post new topic Reply to topic

Page 1 of 2

[ 25 posts ]

Print view

Previous topic | Next topic

Author

Message

markos

Post subject: libfreevec released!

PostPosted: Mon Sep 26, 2005 12:20 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Hi everyone!
I'd like to announce the public release of libfreevec, after a long period of development, frustration and a mix of success/failures stories. After many rewrites, I can finally present something that can at least be used with mostly good results.

First, a short info:
libfreevec is a free (LGPL) library with hand-optimized replacement routines for GLIBC, such as memcpy(), strlen(), etc. These routines have been written specifically to take advantage of the AltiVec (a.k.a Velocity Engine or VMX), and will only work on processors that include this unit. This means they will not work on older processors, such as 603, 604, 750 (G3) or the POWER family of CPUs.

Check the site for more information, and download the source (soon a Debian package will be available):

http://freevec.org/

Thanks to Genesi for sponsoring this effort and for creating the ODW, a great development platform!

Regards

Konstantinos Margaritis

PS. For the impatient amongst you, make sure you set
CFLAGS="-O3 -maltivec -funroll-loops -DHAVE_ALTIVEC_H"
to compile for max performance.

Top

Profile

Reply with quote

gunnar

Post subject: Re: libfreevec released!

PostPosted: Mon Sep 26, 2005 12:37 pm

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Great news!
Thanks Markos!

I'll need to spy on your memcmp code

Cheers
Gunnar

Top

Profile

Reply with quote

markos

Post subject: Re: libfreevec released!

PostPosted: Mon Sep 26, 2005 12:55 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

Great news!
Thanks Markos!

I'll need to spy on your memcmp code

Cheers
Gunnar

Hehe, i already sent you 2 custom memcmp versions

Top

Profile

Reply with quote

bbrv

Post subject: Re: libfreevec released!

PostPosted: Mon Sep 26, 2005 1:55 pm

Offline

Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422

GREAT!

R&B

Top

Profile

Reply with quote

RockmanX

Post subject: Re: libfreevec released!

PostPosted: Mon Sep 26, 2005 11:56 pm

Offline

Joined: Tue Dec 14, 2004 3:52 am
Posts: 42
Location: Italy - from Mexico

I wonder if there will be any libfreevec-enabled sources of some big pieces of software...
I guess it is already possible to use it, but I'm not a developer, so I'd dare not try to mess up with code...
Would something like mplayer take benefits from libfreevec?

I apologize for any possible manifestation of ignorance...

Top

Profile

Reply with quote

lu_zero

Post subject: Re: libfreevec released!

PostPosted: Thu Sep 29, 2005 2:13 am

Offline

Joined: Thu Nov 18, 2004 11:48 am
Posts: 110

the gentoo ebuild is available in portage, you have to properly unmask it in order to emerge it

echo "sys-libs/libfreevec -ppc" >>/etc/portage/package.keywords

The next week I'll have a look on how will work on mplayer and ffmpeg if nobody else will do before

Just to restate: DO NOT PRELOAD IT

Top

Profile

Reply with quote

hobold

Post subject: Re: libfreevec released!

PostPosted: Fri Sep 30, 2005 2:37 am

Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35

I worked out a little trick that can be used to scan along a vector, i.e. one can find the leftmost or rightmost 'true' element of a vector of bools.

It is based on the observation that the expression "X & (X-1)" clears the rightmost "1" bit. Consequently, "(X & (X-1)) ^ X" isolates that rightmost bit.

The trick might be useful for a string compare. Say you found the vector where the two strings differ, but all you have is a vector int mask that is true for the differing elements. Then you could proceed like this:

1. permute the mask into 32 bits of bool char, reversing the order of elements.
example permute vector: (0,0,0,0, 0,0,0,0, 0,0,0,0, 12,8,4,0)

2. subtract 1 (32 bit wide elements)

3. AND the results of (1) and (2)

4. XOR the results of (3) and (1)
=> now the least significant "1" bit is isolated

5. "compare equal" byte-wise the result of (4) with (1,1,1, ...)

6. permute the result of (5) to get a bool int mask in correct order.
permute vector: (15,15,15,15, 14,14,14,14, 13,13,13,13, 12,12,12,12)
=> now we have a vector bool int mask that is set for the leftmost mismatching element

7. the differing vectors of the input strings can now both be ANDed with the mask resulting from (6)

8. compare the masked vectors (with suitable "compare any"), using the scalar boolean return value

9. return two times the result of (8 ) minus 1 for a correct -1/+1 result

Top

Profile

Reply with quote

markos

Post subject: Re: libfreevec released!

PostPosted: Fri Sep 30, 2005 8:24 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Hey Hobold!
Thanks, that looks like a nice way, the nice thing is that it will totally eliminate branching, I'll try to implement it over the weekend, will post here for comments

Thanks again!

Konstantinos

Top

Profile

Reply with quote

hobold

Post subject: Re: libfreevec released!

PostPosted: Fri Sep 30, 2005 12:59 pm

Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35

Something like this has been haunting me ever since I encountered the X & (X-1) trick. It wasn't until today that all pieces fell into place.

Here's an exhaustively tested routine generalized to handle vector bool char. It performs the described scan on int elements, and also individually on the four chars of each 32bit lane. The AND of those two is then the overall scan of a vector bool char.

For lack of a better name, I call bool vectors with exactly one element set to true "singular masks". I believe they deserve to have a name, because there surely must be a few nice tricks that can be done once you have a singular mask; at least one will follow below.

Beware, the example routine scans from right to left, because that is the 'natural' scan order of the underlying method. Byte-reversing permutes are required to scan the other way.

Code:

// isolates the rightmost 'true' element of a vector bool char

vector bool char RightSingularChar(vector bool char Mask) {

  vector unsigned char pack32_8 = (vector unsigned char)(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12);

  vector unsigned char unpack8_32 = (vector unsigned char)(12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15);

  // generate bool int mask from bool char mask

  vector bool int intMask = vec_cmpgt((vector unsigned int)Mask, vec_splat_u32(0));

  // compress; only the rightmost 32 bit element is really significant here

  intMask = vec_perm(intMask, intMask, pack32_8);

  // isolate rightmost bits of char mask and int mask

  intMask = vec_xor(intMask, vec_and(intMask, vec_sub((vector unsigned int)intMask, vec_splat_u32(1))));

  Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) vec_sub((vector unsigned int)Mask, vec_splat_u32(1))));

  // splat isolated bits

  intMask = (vector bool int)vec_cmpeq((vector unsigned char)intMask, vec_splat_u8(1));

  Mask = vec_cmpeq((vector unsigned char)Mask, vec_splat_u8(1));

  // unpack intMask to full size again

  intMask = vec_perm(intMask, intMask, unpack8_32);

  // combine Mask and intMask

  return vec_and((vector bool char)intMask, Mask);

}

Once you have a singular mask, you can find the index of the 'true' element like this:

1. AND the singular mask with vec_lvsl(0, NULL);

2. vec_sum4s (add bytes across each 32 bit lane)

3. vec_sums (add ints across a vector)

the byte index is now in the rightmost element of the result

Edit: as usual, my code comes with an 'artistic license': use it as you wish, but credit me, please.

Top

Profile

Reply with quote

markos

Post subject: Re: libfreevec released!

PostPosted: Sun Oct 02, 2005 12:52 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

Here's an exhaustively tested routine generalized to handle vector bool char. It performs the described scan on int elements, and also individually on the four chars of each 32bit lane. The AND of those two is then the overall scan of a vector bool char.

Due to some strange bug in AltiVec optimization in gcc, which makes it consume HUGE amounts of RAM (it ran out of memory on my system, with 1.5GB total memory!!!), I had to rewrite your code a bit, apparently gcc 4.0 on Linux doesn't like inlined altivec intrinsics that much.

Code:

// isolates the rightmost 'true' element of a vector bool char 

vector bool char RightSingularChar(vector bool char Mask) { 

   vector uint8_t pack32_8 = { 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12 }; 

   vector uint8_t unpack8_32 = { 12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15 };

   vector uint8_t vone8 = vec_splat_u8(1);

   vector uint32_t vzero32 = vec_splat_u32(0), vone32 = vec_splat_u32(1);

   // generate bool int mask from bool char mask 

   vector bool int intMask = vec_cmpgt((vector uint32_t)Mask, vzero32); 

   // compress; only the rightmost 32 bit element is really significant here 

   intMask = vec_perm(intMask, intMask, pack32_8); 

   // isolate rightmost bits of char mask and int mask 

   vector bool int tempMask = vec_sub((vector uint32_t)intMask, vone32);

   intMask = vec_xor(intMask, vec_and(intMask, tempMask));

   tempMask = vec_sub((vector uint32_t)Mask, vone32);

   Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) tempMask)); 

   // splat isolated bits 

   intMask = (vector bool int)vec_cmpeq((vector uint8_t)intMask, vone8); 

   Mask = vec_cmpeq((vector uint8_t)Mask, vone8); 

   // unpack intMask to full size again 

   intMask = vec_perm(intMask, intMask, unpack8_32); 

   // combine Mask and intMask 

   return vec_and((vector bool char)intMask, Mask);

}

The problem is that I can't get it to work like this:
Assuming I pass this mask to this routine:

Code:

vector bool char vmask = { ff,ff,ff,ff, ff,ff,ff,0, ff,ff,ff,ff, ff,ff,ff,ff };

If I call RightSingularChar on vmask, like this:

Code:

vector bool char vres = RightSingularChar(vmask);

I get this result (if I print vres):

Code:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ff

Maybe it's my fault in rewriting it, but i tried to make as few changes as possible.

Also, in rewriting this code for left-to-right, where should I use the permutes? in the initial mask (tried that, didn't work), or inside the algorithm itself?

Konstantinos

Top

Profile

Reply with quote

hobold

Post subject: Re: libfreevec released!

PostPosted: Sun Oct 02, 2005 11:20 pm

Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35

The algorithm works exactly as intended in your example: it keeps the rightmost 0xff char element, but clears all other bytes to zero. So to find the 0x00, you'd have to flip all bits (or reverse the sense of the compare that generates the mask).

It may be possible to write an analogous routine that isolates rightmost 0x00 values rather than 0xff values ... based on 'X | (X+1)'. That would save one bit flip.

For scanning the other way, just byte-revert the mask before you pass it to the scan routine. Then reverse the resulting singular mask again with the same permute. If you only want to count the index, then you may not need the second byte reversal at the end, but instead use a (15, 14, 13, 12, ...) vector for the counting trick instead of the suggested (0, 1, 2, 3, ...). (The byte-reversed index vector can be had by xor'ing the result of a vec_lvsl(0, NULL) byte-wise with 0x0f.)

Top

Profile

Reply with quote

markos

Post subject: Re: libfreevec released!

PostPosted: Mon Oct 03, 2005 2:21 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

Indeed you were right!! It worked! I've integrated this into memcmp in libfreevec and it seems to really fly!!!

I'll add this to the rest of routines that work similarly (memchr, strlen, etc).

Many Thanks!!!

Konstantinos

Top

Profile

Reply with quote

bbrv

Post subject: Re: libfreevec released!

PostPosted: Mon Oct 03, 2005 2:55 am

Offline

Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422

Konstantinos, how about some performance charts posted there reflecting the improvement once you get these integrated? Great!

Thanks Holger!

R&B

Top

Profile

Reply with quote

hobold

Post subject: Re: libfreevec released!

PostPosted: Mon Oct 03, 2005 7:41 am

Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35

If you want to scan for 0x00, I can work out the respective routine as well. It may save only a single instruction, but every little bit helps.

BTW, your version of the routine looks a lot nicer than mine and reads more clearly. I've been too hasty in throwing this together once I saw how it worked.

I don't have an overview what types of other 'scan' or 'collect' operations might be useful to you, so if you have specific wishes, bring them on! Maybe there are more neat little tricks that can be applied.

Such tricks, if they exist, should particularly help short strings. With any luck, the vectorized routine can then beat the scalar code across the board, not just for large data sets.

Top

Profile

Reply with quote

lu_zero

Post subject: Re: libfreevec released!

PostPosted: Tue Oct 04, 2005 6:13 am

Offline

Joined: Thu Nov 18, 2004 11:48 am
Posts: 110

Quote:

the gcc intrinsics are quite large macros, nesting them will exponentially consume memory on the preprocessing step.

My rule of thumb is to just have at most 2 level of nesting to avoid that side effect.

The compilation step should remove most of the temp variables (hopefully);

lu

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 1 of 2

[ 25 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 3 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum