Power Developer https://powerdeveloper.org/forums/ |
|
memchr() vectorized too https://powerdeveloper.org/forums/viewtopic.php?f=23&t=177 |
Page 1 of 1 |
Author: | markos [ Sat Mar 12, 2005 7:21 am ] |
Post subject: | memchr() vectorized too |
I just finished memchr(), the code will be commited shortly to the cvs. The nice thing is that strlen is pretty much the same to memchr() so we have 2 vectorized functions actually Code:
$ ./altivectorize -v -s -g --tests memchr --norandom --loops 1000000
I think, that from now on i'll just post the benchmarks, with very few comments
Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memchr test: memchr #size arrays glibc altivec (Effective bandwidth) 7 599186 0.050 (133.5 MB/s) 0.090 (74.2 MB/s) (0.6x) 13 325000 0.120 (103.3 MB/s) 0.100 (124.0 MB/s) (1.2x) 16 262144 0.080 (190.7 MB/s) 0.100 (152.6 MB/s) (0.8x) 20 209715 0.130 (146.7 MB/s) 0.160 (119.2 MB/s) (0.8x) 27 155344 0.140 (183.9 MB/s) 0.180 (143.1 MB/s) (0.8x) 35 119837 0.160 (208.6 MB/s) 0.180 (185.4 MB/s) (0.9x) 43 97542 0.120 (341.7 MB/s) 0.210 (195.3 MB/s) (0.6x) 54 77672 0.170 (302.9 MB/s) 0.190 (271.0 MB/s) (0.9x) 64 65536 0.200 (305.2 MB/s) 0.210 (290.6 MB/s) (1.0x) 90 46603 0.250 (343.3 MB/s) 0.200 (429.2 MB/s) (1.2x) 128 32768 0.340 (359.0 MB/s) 0.220 (554.9 MB/s) (1.5x) 185 22672 0.470 (375.4 MB/s) 0.280 (630.1 MB/s) (1.7x) 256 16384 0.640 (381.5 MB/s) 0.290 (841.9 MB/s) (2.2x) 347 12087 0.850 (389.3 MB/s) 0.340 (973.3 MB/s) (2.5x) 512 8192 1.170 (417.3 MB/s) 0.350 (1395.1 MB/s) (3.3x) 831 5047 1.840 (430.7 MB/s) 0.500 (1585.0 MB/s) (3.7x) 2048 2048 4.770 (409.5 MB/s) 1.060 (1842.6 MB/s) (4.5x) 3981 1053 8.920 (425.6 MB/s) 1.890 (2008.8 MB/s) (4.7x) 8192 512 18.420 (424.1 MB/s) 3.530 (2213.2 MB/s) (5.2x) 13488 311 30.380 (423.4 MB/s) 5.960 (2158.2 MB/s) (5.1x) 16384 256 38.550 (405.3 MB/s) 7.230 (2161.1 MB/s) (5.3x) 38893 108 92.030 (403.0 MB/s) 17.510 (2118.3 MB/s) (5.3x) 65536 64 153.320 (407.6 MB/s) 33.110 (1887.6 MB/s) (4.6x) 105001 40 252.450 (396.7 MB/s) 48.970 (2044.9 MB/s) (5.2x) 262144 16 738.860 (338.4 MB/s) 205.150 (1218.6 MB/s) (3.6x) 600000 7 3137.570 (182.4 MB/s) 2000.820 (286.0 MB/s) (1.6x) 1134355 4 7318.030 (147.8 MB/s) 6049.960 (178.8 MB/s) (1.2x) 2097152 2 14181.550 (141.0 MB/s) 12319.100 (162.3 MB/s) (1.2x) |
Author: | markos [ Sun Mar 13, 2005 7:32 pm ] |
Post subject: | Re: memchr() vectorized too |
Same results without the --norandom flag (minimize cache hits). Code:
$ ./altivectorize -v -s -g --tests memchr --loops 1000000
Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memchr test: memchr #size arrays glibc altivec (Effective bandwidth) 7 599186 0.160 (41.7 MB/s) 0.190 (35.1 MB/s) (0.8x) 13 325000 0.230 (53.9 MB/s) 0.220 (56.4 MB/s) (1.0x) 16 262144 0.610 (25.0 MB/s) 0.560 (27.2 MB/s) (1.1x) 20 209715 0.240 (79.5 MB/s) 0.280 (68.1 MB/s) (0.9x) 27 155344 0.210 (122.6 MB/s) 0.280 (92.0 MB/s) (0.7x) 35 119837 0.270 (123.6 MB/s) 0.260 (128.4 MB/s) (1.0x) 43 97542 0.290 (141.4 MB/s) 0.290 (141.4 MB/s) (1.0x) 54 77672 0.320 (160.9 MB/s) 0.300 (171.7 MB/s) (1.1x) 64 65536 0.970 (62.9 MB/s) 1.080 (56.5 MB/s) (0.9x) 90 46603 0.370 (232.0 MB/s) 0.310 (276.9 MB/s) (1.2x) 128 32768 1.430 (85.4 MB/s) 1.370 (89.1 MB/s) (1.0x) 185 22672 0.600 (294.0 MB/s) 0.360 (490.1 MB/s) (1.7x) 256 16384 2.010 (121.5 MB/s) 1.950 (125.2 MB/s) (1.0x) 347 12087 1.110 (298.1 MB/s) 0.550 (601.7 MB/s) (2.0x) 512 8192 3.950 (123.6 MB/s) 3.100 (157.5 MB/s) (1.3x) 831 5047 2.460 (322.2 MB/s) 0.860 (921.5 MB/s) (2.9x) 2048 2048 13.710 (142.5 MB/s) 11.320 (172.5 MB/s) (1.2x) 3981 1053 9.440 (402.2 MB/s) 1.960 (1937.0 MB/s) (4.8x) 8192 512 52.950 (147.5 MB/s) 43.840 (178.2 MB/s) (1.2x) 13488 311 48.320 (266.2 MB/s) 17.400 (739.3 MB/s) (2.8x) 16384 256 106.520 (146.7 MB/s) 89.950 (173.7 MB/s) (1.2x) 38893 108 218.540 (169.7 MB/s) 159.490 (232.6 MB/s) (1.4x) 65536 64 431.160 (145.0 MB/s) 349.550 (178.8 MB/s) (1.2x) 105001 40 620.610 (161.4 MB/s) 470.980 (212.6 MB/s) (1.3x) 262144 16 1691.410 (147.8 MB/s) 1387.750 (180.1 MB/s) (1.2x) 600000 7 3827.830 (149.5 MB/s) 3122.020 (183.3 MB/s) (1.2x) 1134355 4 7697.620 (140.5 MB/s) 6415.220 (168.6 MB/s) (1.2x) 2097152 2 14585.110 (137.1 MB/s) 11960.200 (167.2 MB/s) (1.2x) |
Author: | markos [ Mon Mar 14, 2005 9:37 pm ] |
Post subject: | memrchr(), and memmove() vectorized. |
I just finished memrchr() and memmove(). The code is in the CVS. Actually there's also memcmp(), seems to work, but it's terribly slow, I must work on it a little more to see what I'm doing wrong. memcmp() does exist on libmotovec and i'm trying to read the code now to see how it's done there, because its performance is what's expected from altivec . If someone could look at memcmp() and explain the process there, I'd be more than grateful. I tried to read it but as I think I said too many times, I'm not an assembly guy Anyway, here are some benchmarks for memrchr() and memmove(): Code:
$ ./altivectorize -v -s -g --tests memrchr --norandom --loops 1000000
Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memrchr test: memrchr #size arrays glibc altivec (Effective bandwidth) 7 599186 0.050 (133.5 MB/s) 0.100 (66.8 MB/s) (0.5x) 13 325000 0.160 (77.5 MB/s) 0.100 (124.0 MB/s) (1.6x) 16 262144 0.090 (169.5 MB/s) 0.140 (109.0 MB/s) (0.6x) 20 209715 0.100 (190.7 MB/s) 0.240 (79.5 MB/s) (0.4x) 27 155344 0.110 (234.1 MB/s) 0.180 (143.1 MB/s) (0.6x) 35 119837 0.160 (208.6 MB/s) 0.190 (175.7 MB/s) (0.8x) 43 97542 0.180 (227.8 MB/s) 0.210 (195.3 MB/s) (0.9x) 54 77672 0.190 (271.0 MB/s) 0.190 (271.0 MB/s) (1.0x) 64 65536 0.200 (305.2 MB/s) 0.250 (244.1 MB/s) (0.8x) 90 46603 0.290 (296.0 MB/s) 0.230 (373.2 MB/s) (1.3x) 128 32768 0.360 (339.1 MB/s) 0.290 (420.9 MB/s) (1.2x) 185 22672 0.500 (352.9 MB/s) 0.270 (653.4 MB/s) (1.9x) 256 16384 0.660 (369.9 MB/s) 0.320 (762.9 MB/s) (2.1x) 347 12087 0.870 (380.4 MB/s) 0.310 (1067.5 MB/s) (2.8x) 512 8192 1.180 (413.8 MB/s) 0.420 (1162.6 MB/s) (2.8x) 831 5047 2.100 (377.4 MB/s) 0.600 (1320.8 MB/s) (3.5x) 2048 2048 5.090 (383.7 MB/s) 1.090 (1791.9 MB/s) (4.7x) 3981 1053 9.850 (385.4 MB/s) 1.960 (1937.0 MB/s) (5.0x) 8192 512 19.960 (391.4 MB/s) 4.330 (1804.3 MB/s) (4.6x) 13488 311 32.860 (391.5 MB/s) 6.670 (1928.5 MB/s) (4.9x) 16384 256 40.020 (390.4 MB/s) 7.880 (1982.9 MB/s) (5.1x) 38893 108 95.040 (390.3 MB/s) 20.350 (1822.7 MB/s) (4.7x) (more...) $ ./altivectorize -v -s -g --tests memmove --norandom --loops 1000000 Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memmove #size arrays glibc altivec (Effective bandwidth) 7 599186 0.130 (51.4 MB/s) 0.120 (55.6 MB/s) (1.1x) 13 325000 0.150 (82.7 MB/s) 0.150 (82.7 MB/s) (1.0x) 16 262144 0.150 (101.7 MB/s) 0.160 (95.4 MB/s) (0.9x) 20 209715 0.160 (119.2 MB/s) 0.230 (82.9 MB/s) (0.7x) 27 155344 0.160 (160.9 MB/s) 0.170 (151.5 MB/s) (0.9x) 35 119837 0.180 (185.4 MB/s) 0.160 (208.6 MB/s) (1.1x) 43 97542 0.200 (205.0 MB/s) 0.180 (227.8 MB/s) (1.1x) 54 77672 0.210 (245.2 MB/s) 0.150 (343.3 MB/s) (1.4x) 64 65536 0.180 (339.1 MB/s) 0.220 (277.4 MB/s) (0.8x) 90 46603 0.210 (408.7 MB/s) 0.200 (429.2 MB/s) (1.1x) 128 32768 0.240 (508.6 MB/s) 0.230 (530.7 MB/s) (1.0x) 185 22672 0.270 (653.4 MB/s) 0.180 (980.2 MB/s) (1.5x) 256 16384 0.350 (697.5 MB/s) 0.280 (871.9 MB/s) (1.2x) 347 12087 0.420 (787.9 MB/s) 0.220 (1504.2 MB/s) (1.9x) 512 8192 0.560 (871.9 MB/s) 0.350 (1395.1 MB/s) (1.6x) 831 5047 0.930 (852.2 MB/s) 0.400 (1981.3 MB/s) (2.3x) 2048 2048 2.130 (917.0 MB/s) 0.960 (2034.5 MB/s) (2.2x) 3981 1053 3.770 (1007.0 MB/s) 1.050 (3615.8 MB/s) (3.6x) 8192 512 8.960 (871.9 MB/s) 3.340 (2339.1 MB/s) (2.7x) 13488 311 15.190 (846.8 MB/s) 5.410 (2377.7 MB/s) (2.8x) 16384 256 17.670 (884.3 MB/s) 6.760 (2311.4 MB/s) (2.6x) 38893 108 38.800 (956.0 MB/s) 17.900 (2072.1 MB/s) (2.2x) (more...) |
Author: | hobold [ Tue Mar 15, 2005 3:02 am ] |
Post subject: | Re: memchr() vectorized too |
I think memcmp() is tricky, because there are three varying conditions even before you compary any two chars to each other: alignments of either source and string length. I see some danger of doing way too many conditional branches. My first attempt would be to treat both strings as unaligned and handle them the usual way (vec_lvsl(ptr), vec_ld(0, ptr), vec_ld(15, ptr), vec_perm). Compare for equality (vec_all_eq) and loop until string length is reached. Special care has to be taken for the last partial vector; it might be easiest to do that in scalar code (but that would mean no speedup whatsoever for string sizes < 16). To vectorize this, I'd use a boolean mask to AND with, setting all extraneous chars to zero. The mask can probably be created by comparing the result of vec_lvsl or vec_lvsr of the length (cast to void*) to an appropriate threshold. If at any time the vec_all_eq is false, you need to locate the first offending char position. This is one of the few things that AltiVec has no direct support for. I suspect it might be fastest to do that in scalar code, four bytes at a time with unsigned int compares. All G4 and G5 PPC models have reasonably efficient support for unaligned scalar loads, so performance should be okay. In the worst case, four scalar comparisons have to be done to find the difference in the last quarter of a vector. |
Author: | markos [ Tue Mar 15, 2005 12:22 pm ] |
Post subject: | Re: memchr() vectorized too |
This is the memcmp() i wrote: Code:
int memcmp_vec_aligned(const void *s1, const void *s2, size_t len) {
and these are the results i got:unsigned char* s1first = s1; unsigned char* s2first = s2; vector unsigned char s1vector, s2vector; //printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len); while (len >= 16) { s1vector = vec_ld(0, s1first); s2vector = vec_ld(0, s2first); if (vec_any_ne(s1vector, s2vector)) { //printf("Vectors not equal, calling memcmp\n"); return memcmp(s1first, s2first, 16); } s1first += 16; s2first += 16; len -= 16; //printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len); } if (len > 0) { //printf("Stuff left, calling memcmp()\n"); return memcmp(s1first, s2first, len); } return 0; } int memcmp_vec(const void *s1, const void *s2, size_t len) { if (len <= MEMCMP_THRESHOLD) { return memcmp(s1, s2, len); } // for inline char -> vector char conversion unsigned char* s1first = s1; unsigned char* s2first = s2; unsigned char *temp1; int temp2, result; int s1offset = (int)s1first & 15; int s2offset = (int)s2first & 15; vector unsigned char s1vector, s2vector; vector unsigned char MSQ, LSQ; vector unsigned char mask; #ifdef DEBUG printf("srcfirst = %x\tsrclast = %x\n", s1first, s1first+len-1); printf("dstfirst = %x\tdstlast = %x\n", s2first, s2first+len-1); printf("s1offset = %ld\ts2offset = %ld\n", s1offset, s2offset); #endif if (s1offset == 0 && s2offset == 0) { // Both buffers are 16-byte aligned, just compare the aligned values return memcmp_vec_aligned(s1first, s2first, len); } else { // Use standard memcmp to copy the few bytes at the beginning result = memcmp(s1first, s2first, s1offset); if (result) return result; // Advance both pointers appropriately, now s1first is 16-byte aligned s1first += s1offset; s2first += s1offset; len -= s1offset; // recalculate the s2offset s2offset = 16 - ((int)s2first & 15); if (s2offset == 0) { return memcmp_vec_aligned(s1first, s2first, len); } while (len >= 16) { MSQ = vec_ld(0, s2first); // most significant quadword LSQ = vec_ld(15, s2first); // least significant quadword mask = vec_lvsl(0, s2first); // create the permute mask if (vec_any_ne(vec_ld(0, s1first), vec_perm(MSQ, LSQ, mask))) return memcmp(s1first, s2first, 16); s1first += 16; s2first += 16; len -= 16; } if (len > 0) { return memcmp(s1first, s2first, len); } } return 0; } Code:
$ ./altivectorize -v -s -g --tests memcmp --norandom --loops 1000000
Now, I really can't understand why I get such abysmal performance.
Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memcmp #size arrays glibc altivec (Effective bandwidth) 7 599186 0.020 (333.8 MB/s) 0.100 (66.8 MB/s) (0.2x) 13 325000 0.030 (413.3 MB/s) 0.130 (95.4 MB/s) (0.2x) 16 262144 0.020 (762.9 MB/s) 0.130 (117.4 MB/s) (0.2x) 20 209715 0.030 (635.8 MB/s) 0.170 (112.2 MB/s) (0.2x) 27 155344 0.040 (643.7 MB/s) 0.190 (135.5 MB/s) (0.2x) 35 119837 0.040 (834.5 MB/s) 0.140 (238.4 MB/s) (0.3x) 43 97542 0.050 (820.2 MB/s) 0.190 (215.8 MB/s) (0.3x) 54 77672 0.040 (1287.5 MB/s) 0.160 (321.9 MB/s) (0.2x) 64 65536 0.040 (1525.9 MB/s) 0.140 (436.0 MB/s) (0.3x) 90 46603 0.060 (1430.5 MB/s) 0.200 (429.2 MB/s) (0.3x) 128 32768 0.070 (1743.9 MB/s) 0.160 (762.9 MB/s) (0.4x) 185 22672 0.090 (1960.3 MB/s) 0.210 (840.1 MB/s) (0.4x) 256 16384 0.120 (2034.5 MB/s) 0.210 (1162.6 MB/s) (0.6x) 347 12087 0.170 (1946.6 MB/s) 0.300 (1103.1 MB/s) (0.6x) 512 8192 0.250 (1953.1 MB/s) 0.320 (1525.9 MB/s) (0.8x) 831 5047 0.390 (2032.1 MB/s) 0.520 (1524.0 MB/s) (0.8x) 2048 2048 0.920 (2123.0 MB/s) 1.150 (1698.4 MB/s) (0.8x) 3981 1053 1.990 (1907.8 MB/s) 2.570 (1477.3 MB/s) (0.8x) 8192 512 3.630 (2152.2 MB/s) 5.060 (1544.0 MB/s) (0.7x) 13488 311 6.250 (2058.1 MB/s) 10.020 (1283.7 MB/s) (0.6x) 16384 256 7.940 (1967.9 MB/s) 10.630 (1469.9 MB/s) (0.7x) |
Author: | afleming [ Thu Mar 17, 2005 9:06 am ] |
Post subject: | Re: memchr() vectorized too |
I thought about the scalar memcmp a bit, and I figure that it probably unrolls the loop, and could theoretically do 16 bytes every 8 or 9 cycles. With the improvements I suggested on IRC, the unaligned case is going to look something like this in assembly: lvx s1 lvx s2+16 permute compare branch decrement len (and update CR) increment index branch On a G4, this will take at least 10 cycles per iteration. More, actually, since the G4 will have to take a cycle to redirect flow on the looping branch. This means that the scalar code will outperform the simple vector loop. You'll have to unroll the loop to beat it. Something like this: a1 = vec_ld(0,s1) LSQ = vec_ld(16,s2) a2 = vec_ld(16,s1) b2 = vec_ld(32,s2) if (vec_any_ne(a1, vec_perm(MSQ, LSQ, mask))) return memcmp(s1, s2, 16) if (vec_any_ne(a2, vec_perm(LSQ, b2, mask))) return memcmp(s1+16, s1+16, 16) s1+=32 s2+=32 len -= 16 (loop) Of course, this needs to be scheduled properly by the compiler, but we'll just hope it does that automatically. |
Page 1 of 1 | All times are UTC-06:00 |
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/ |