Power Developer https://powerdeveloper.org/forums/ |
|
Altivec strfill() benchmarks https://powerdeveloper.org/forums/viewtopic.php?f=23&t=58 |
Page 1 of 1 |
Author: | markos [ Sun Oct 31, 2004 11:38 am ] |
Post subject: | Altivec strfill() benchmarks |
(reposted here, on Sven's advice) Hi guys, I've just completed my first altivec program, a very simple one indeed and I have to say I'm really impressed! I did an optimized version of a common routine strfill(), quite common (and used in this form in MySQL), which fills a given string with a given char. Here are the benchmarks for both scalar and altivec versions, with different string sizes: Code:
#size scalar altivec
(Note: I used both 'random' sizes and powers of 2, but this didn't seem to have any impact.)13 5 2 16 6 1 27 12 3 64 23 2 90 29 3 128 43 3 185 59 5 256 81 6 347 111 8 512 164 11 831 277 18 2048 686 40 3981 1316 86 and here is the code that produced this output: Code:
#include <altivec.h>
I have to say, this code has bugs and is not very optimized, I'm a newbie when it comes to Altivec. But it did convince me that even a newbie like me can write fast Altivec code! I'm posting here for comments and maybe ideas to make it faster, safer etc, so if I've done something exceedingly stupid in this code, please tell me #include <stdio.h> #include <time.h> // This one was shamelessy stolen and adapted from Apple's Altivec tutorial vector unsigned char inline vec_ldsplatchar(unsigned char splatchar) { vector unsigned char splatmap = vec_lvsl(0, &splatchar); vector unsigned char result = vec_lde(0, &splatchar); splatmap = vec_splat(splatmap, 0); return vec_perm(result, result, splatmap); } unsigned char *vec_strfill(unsigned char *s, int len, char p) { int i; vector unsigned char sm = vec_ldsplatchar(p); vector unsigned char *v1 = (vector unsigned char *)s; vector unsigned char vec_a; for (i=0; i < len-1; i = i+16) { vec_a = vec_ld(0, v1); vec_a = vec_splat(sm, 0); vec_st(vec_a, i, s); } return s; } unsigned char *strfill(unsigned char *s, int len,char fill) { while (len--) *s++ = fill; *(s) = '\0'; return(s); } /* strfill */ int main( void ) { int i, j, k, max, loops = 100000000; time_t dt1, dt2, t1, t0; int sizes[] = {13,16,27,64,90,128,185,256,347,512,831,2048,3981,8192,10000}; printf("#size\tscalar\taltivec\n"); for (k = 0; k < 16; k++) { max = sizes[k]; unsigned char __attribute__ ((aligned(16))) test[max]; unsigned char __attribute__ ((aligned(16))) splatchar = rand(); t0 = time(NULL); for (j=0; j < loops; j++) { test[max] = '\0'; strfill(test, max, splatchar); } t1 = time(NULL); dt1 = t1 - t0; t0 = time(NULL); for (j=0; j < loops; j++) { vec_strfill(test, max, splatchar); } t1 = time(NULL); dt2 = t1 - t0; printf("%ld\t%ld\t%ld\n", max, dt1, dt2); } return 0; } Also, I used a simple time() function, as I couldn't get pmon to work for me on 2.6.8. Thanks to Luca for his excellent comments and pointers! Thanks to Genesi for designing such a nice system as Pegasos 2, the more I use it, the more I like it Regards Konstantinos |
Author: | hobold [ Mon Nov 01, 2004 9:45 am ] |
Post subject: | Re: Altivec strfill() benchmarks |
Here's a variant that handles alignment. It is untested, but it compiles. Convincing yourself that it does indeed work (or that it is buggy) is left as an enlightening exercise for the reader. Code:
/* All AltiVec loads and stores silently round down their addresses to a
This is perhaps not the ultimately fastest way to do things, but it should come quite close and is hopefully readable enough to serve as an illustrative example.
multiple of 16. Handling of arbitrary string buffers must take that into account. general case: partial vector at start and end of string |.....--+-------+-------+-----..| so we must handle a partial vector first, then a number of whole vectors, and another partial vector at the end special case: first and last char of string within same vector |.----..| partial vectors are handled by loading the original value, merging in the modified bytes with vec_sel, and storing back the modified vector */ unsigned char *vec_strfill(unsigned char *s, int len, char p) { unsigned char* ptrlast = s + (len - 1); unsigned char* ptrfirst = s; // for inline char -> vector char conversion union { vector unsigned char v; unsigned char c[16]; } envelope; envelope.c[0] = p; // this code fails for len <= 0 when first and last positions are in // different vectors, but that is an undefined case anyway assert (len > 0); // these maske are 0xff where we want to modify chars in memory, 0 otherwise // (vec_lvsr creates a vector value based on the address alignment; these are // usually used with vec_perm, but I compare them to a threshold value to // create a mask vector of 0xff/0x00 chars) vector unsigned char firstmask; firstmask = (vector unsigned char) vec_cmpgt(vec_lvsr(0, ptrfirst), vec_splat_u8(15)); // (the type cast works around a bug in Apple GCC) // vec_splat can only take values from -16 to +15, so I have to build the // needed value 17 manually (vec_splat_u8(15) is a common subexpression) vector unsigned char lastmask; lastmask = (vector unsigned char) vec_cmplt(vec_lvsr(0, ptrlast), vec_add(vec_splat_u8(15), vec_splat_u8(2))); // (the type cast works around a bug in Apple GCC) // build a vector full of fill chars vector unsigned char fillchar = vec_splat(envelope.v, 0); // first and last chars in same vector? (xor used as bit-wise compare) if ((((long)ptrlast ^ (long)ptrfirst) & ~15L) == 0) { firstmask = vec_and(firstmask, lastmask); // simply combine masks // read, modify (based on mask), write vec_st(vec_sel(vec_ld(0, ptrfirst), fillchar, firstmask), 0, ptrfirst); return s; } // handle first vector: read, modify (based on mask), write vec_st(vec_sel(vec_ld(0, ptrfirst), fillchar, firstmask), 0, ptrfirst); // it can happen that firstmask is all 0xff, so it might pay off to test // whether s is aligned and go directly to the while() loop // reduce remaining length by the amount of chars covered by first vector len -= 16 - ((int)ptrfirst & 15); // now store a number of aligned vectors while (len >= 16) { vec_st(fillchar, 0, ptrfirst); ptrfirst += 16; len -= 16; } // finally there might be a last partial vector // (if lastmask is all 0xff, then len is already zero here) if (len > 0) { // read, modify (based on mask), write vec_st(vec_sel(vec_ld(0, ptrlast), fillchar, lastmask), 0, ptrlast); } return s; } |
Author: | markos [ Tue Nov 02, 2004 3:23 am ] |
Post subject: | Re: Altivec strfill() benchmarks |
Quote: Here's a variant that handles alignment. It is untested, but it compiles. Convincing yourself that it does indeed work (or that it is buggy) is left as an enlightening exercise for the reader.
Ok, I took the benchmark a little further, added this routine as altivec-ng, and added also memset() from libmoto. I #defined strfill() as follows:(snip) This is perhaps not the ultimately fastest way to do things, but it should come quite close and is hopefully readable enough to serve as an illustrative example. Code:
#define vec_strfill_libmoto(s, len, p) ((unsigned char *)memset(s, p, len))
and I also added these checks:Code:
if ( strcmp(test1, test2) == 0) {
at the end of each size loop.printf("test1 = test2)\n"); } if ( strcmp(test1, test3) == 0) { printf("test1 = test3)\n"); } if ( strcmp(test1, test4) == 0) { printf("test1 = test4)\n"); } if ( strcmp(test2, test3) == 0) { printf("test2 = test3)\n"); } if ( strcmp(test2, test4) == 0) { printf("test2 = test4)\n"); } if ( strcmp(test3, test4) == 0) { printf("test3 = test4)\n"); } Here are the results: Code:
#size scalar altivec altivec-ng libmoto
I included my version for reference, though I know it's not correct. I intend to work on Holger's version to correct it and also to understand it, but as for production code, I think that libmoto speaks for itself 13 6 1 5 5 test1 = test3) test1 = test4) test3 = test4) 16 7 1 5 6 test1 = test2) test1 = test3) test1 = test4) test2 = test3) test2 = test4) test3 = test4) 27 11 2 4 4 test1 = test3) test1 = test4) test3 = test4) 64 27 2 5 4 test1 = test2) test1 = test4) test2 = test4) 90 37 3 6 4 test1 = test4) 128 58 4 7 6 test1 = test2) test1 = test4) test2 = test4) 185 75 6 7 8 test1 = test4) 256 106 7 9 7 test1 = test2) test1 = test4) test2 = test4) 347 141 10 10 10 test1 = test4) 512 208 14 13 11 test1 = test2) test1 = test4) test2 = test4) 831 338 21 20 15 test1 = test4) 2048 832 53 43 30 test1 = test2) test1 = test4) test2 = test4) 3981 1617 102 79 55 test1 = test4) 8192 3329 208 161 108 test1 = test2) test1 = test4) test2 = test4) Konstantinos |
Author: | hobold [ Tue Nov 02, 2004 4:53 am ] |
Post subject: | Re: Altivec strfill() benchmarks |
Quote: I included my version for reference, though I know it's not correct. I intend to work on Holger's version to correct it and also to understand it, but as for production code, I think that libmoto speaks for itself It does, as it should!If you are serious about an "ultimate" routine, you can start by replacing the loop of aligned stores with this: Code:
// now store a number of aligned vectors
This will penalize short strings marginally, but help very notably with longer strings. You could also do cache hinting if you do not expect the target strings to reside in cache. This can often gain a factor of two in that case, but it will slow down the cache resident case. while (len >= 64) {   vec_st(fillchar, 0, ptrfirst);   len -= 64;   vec_st(fillchar, 16, ptrfirst);   vec_st(fillchar, 32, ptrfirst);   vec_st(fillchar, 48, ptrfirst);   ptrfirst += 64;  }  if (len >= 32) {   vec_st(fillchar, 0, ptrfirst);   len -= 32;   vec_st(fillchar, 16, ptrfirst);   ptrfirst += 32;  }  if (len >= 16) {   vec_st(fillchar, 0, ptrfirst);   ptrfirst += 16;   len -= 16;  } (The ultimate tuning would come from profiling each call of strfill(), and replace the callee site with calls to specialized routines for either short strings, or long cache resident strings, or long uncached strings, respectively. But that would border on ridiculous over-engineering. Can be fun, though. ) There are more opportunities to shave off a cycle here and there, but the biggest gains come from vectorizing at all. In any real application, programmer time is usually best spent by doing five solidly tuned routines rather than a single perfectly tuned one. However, the first lesson of AltiVec tuning was found by yourself: check if tuned library routines exist and use those. Minimal programming effort for a potentially optimal performance gain. |
Author: | hobold [ Tue Nov 02, 2004 10:44 am ] |
Post subject: | Re: Altivec strfill() benchmarks |
In a vain attempt to regain my pride with more untested code I made another variant that is more seriously aimed at ultimate speed. Even if I had commented it thoroughly, it wouldn't be nearly as readable as my first variant. Code:
unsigned char *vec_strfill(unsigned char *s, int len, char p) {
I am not totally happy with the approach shown in here ... there ought to be a way to do things with fewer conditional branches. I am also leaving a bit of performance on the table when the string buffer starts on an aligned address. Still, initialization overhead should be a few cycles less for every call, and the unrolled loop should help with long strings.
assert(len > 0); // sanity check // build a vector full of copies of p vector unsigned char fillchar = vec_splat(vec_lvsl(0,(unsigned char*)&p), 0); fillchar = vec_perm(vec_ld(0, (unsigned char*)&p), fillchar, fillchar); vector unsigned char firstmask, lastmask; { // local scope limits lifetime of temporary variables vector unsigned char ones, zeros; // to build constant values ones = (vector unsigned char) vec_cmpeq(fillchar, fillchar); // 0xff zeros = (vector unsigned char) vec_sub(fillchar, fillchar); // 0x00 // the instructions chosen here go to the VSIU and can potentially be // executed in parallel with a pair going to VPU and LSU firstmask = vec_perm(zeros, ones, vec_lvsr(0, s)); // lastmask will be zero when the string's end is aligned lastmask = vec_perm(ones, zeros, vec_lvsr(len, s)); } // check if firstmask and lastmask have to be combined if (len <= (((int)s & 15) ^ 15)) { vec_st(vec_sel(vec_ld(0,s), fillchar, vec_and(firstmask, lastmask)), 0,s); return s; } // check if a last partial vector needs to be handled if (vec_all_eq(vec_splat_u8(0), lastmask)) { // no, string ends with aligned vector, so just do first partial vector vec_st(vec_sel(vec_ld(0, s), fillchar, firstmask), 0, s); } else { // do both partial vectors in parallel for better pipeline utilization vector unsigned char firstdata = vec_ld(0, s); vector unsigned char lastdata = vec_ld(len, s); firstdata = vec_sel(firstdata, fillchar, firstmask); lastdata = vec_sel(lastdata, fillchar, lastmask); vec_st(firstdata, 0, s); vec_st(lastdata, len, s); } // handle remaining aligned vectors len -= 1; // number of chars handled in first vector len -= ((int)s & 15 ^ 15); len &= ~15; // number of chars handled in last vector // when there are many short strings, it might pay to do an early exit here if (len == 0) return s; unsigned char* cursor = s + 16; // skip first vector (already processed) // now store a number of aligned vectors while (len >= 64) { vec_st(fillchar, 0, cursor); len -= 64; vec_st(fillchar, 16, cursor); vec_st(fillchar, 32, cursor); vec_st(fillchar, 48, cursor); cursor += 64; } if (len >= 32) { vec_st(fillchar, 0, cursor); len -= 32; vec_st(fillchar, 16, cursor); cursor += 32; } if (len > 0) { vec_st(fillchar, 0, cursor); cursor += 16; len -= 16; } assert(len == 0); // sanity check return s; } |
Author: | lu_zero [ Thu Nov 18, 2004 1:52 pm ] |
Post subject: | New Subject in Same Thread (Change Me!) |
Looks like I'm late and there aren't any space for improvement that I could think about ^^; Just a notice : Altivec intrinsics are implemented as quite long macros, if you nest them make sure you don't pass 2 levels of nesting or you could easly end up with different gb of preprocessed source. gcc _should_ remove temporary variable for you so using them could solve many headache to people with less ram with a little/no overhead. |
Page 1 of 1 | All times are UTC-06:00 |
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/ |