All times are UTC-06:00




Post new topic  Reply to topic  [ 6 posts ] 
Author Message
 Post subject: memchr() vectorized too
PostPosted: Sat Mar 12, 2005 7:21 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
I just finished memchr(), the code will be commited shortly to the cvs. The nice thing is that strlen is pretty much the same to memchr() so we have 2 vectorized functions actually :-)
Code:
$ ./altivectorize -v -s -g --tests memchr --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memchr
test: memchr
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.050 (133.5 MB/s) 0.090 (74.2 MB/s) (0.6x)
13 325000 0.120 (103.3 MB/s) 0.100 (124.0 MB/s) (1.2x)
16 262144 0.080 (190.7 MB/s) 0.100 (152.6 MB/s) (0.8x)
20 209715 0.130 (146.7 MB/s) 0.160 (119.2 MB/s) (0.8x)
27 155344 0.140 (183.9 MB/s) 0.180 (143.1 MB/s) (0.8x)
35 119837 0.160 (208.6 MB/s) 0.180 (185.4 MB/s) (0.9x)
43 97542 0.120 (341.7 MB/s) 0.210 (195.3 MB/s) (0.6x)
54 77672 0.170 (302.9 MB/s) 0.190 (271.0 MB/s) (0.9x)
64 65536 0.200 (305.2 MB/s) 0.210 (290.6 MB/s) (1.0x)
90 46603 0.250 (343.3 MB/s) 0.200 (429.2 MB/s) (1.2x)
128 32768 0.340 (359.0 MB/s) 0.220 (554.9 MB/s) (1.5x)
185 22672 0.470 (375.4 MB/s) 0.280 (630.1 MB/s) (1.7x)
256 16384 0.640 (381.5 MB/s) 0.290 (841.9 MB/s) (2.2x)
347 12087 0.850 (389.3 MB/s) 0.340 (973.3 MB/s) (2.5x)
512 8192 1.170 (417.3 MB/s) 0.350 (1395.1 MB/s) (3.3x)
831 5047 1.840 (430.7 MB/s) 0.500 (1585.0 MB/s) (3.7x)
2048 2048 4.770 (409.5 MB/s) 1.060 (1842.6 MB/s) (4.5x)
3981 1053 8.920 (425.6 MB/s) 1.890 (2008.8 MB/s) (4.7x)
8192 512 18.420 (424.1 MB/s) 3.530 (2213.2 MB/s) (5.2x)
13488 311 30.380 (423.4 MB/s) 5.960 (2158.2 MB/s) (5.1x)
16384 256 38.550 (405.3 MB/s) 7.230 (2161.1 MB/s) (5.3x)
38893 108 92.030 (403.0 MB/s) 17.510 (2118.3 MB/s) (5.3x)
65536 64 153.320 (407.6 MB/s) 33.110 (1887.6 MB/s) (4.6x)
105001 40 252.450 (396.7 MB/s) 48.970 (2044.9 MB/s) (5.2x)
262144 16 738.860 (338.4 MB/s) 205.150 (1218.6 MB/s) (3.6x)
600000 7 3137.570 (182.4 MB/s) 2000.820 (286.0 MB/s) (1.6x)
1134355 4 7318.030 (147.8 MB/s) 6049.960 (178.8 MB/s) (1.2x)
2097152 2 14181.550 (141.0 MB/s) 12319.100 (162.3 MB/s) (1.2x)
I think, that from now on i'll just post the benchmarks, with very few comments :-)


Top
   
PostPosted: Sun Mar 13, 2005 7:32 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Same results without the --norandom flag (minimize cache hits).
Code:
$ ./altivectorize -v -s -g --tests memchr --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memchr
test: memchr
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.160 (41.7 MB/s) 0.190 (35.1 MB/s) (0.8x)
13 325000 0.230 (53.9 MB/s) 0.220 (56.4 MB/s) (1.0x)
16 262144 0.610 (25.0 MB/s) 0.560 (27.2 MB/s) (1.1x)
20 209715 0.240 (79.5 MB/s) 0.280 (68.1 MB/s) (0.9x)
27 155344 0.210 (122.6 MB/s) 0.280 (92.0 MB/s) (0.7x)
35 119837 0.270 (123.6 MB/s) 0.260 (128.4 MB/s) (1.0x)
43 97542 0.290 (141.4 MB/s) 0.290 (141.4 MB/s) (1.0x)
54 77672 0.320 (160.9 MB/s) 0.300 (171.7 MB/s) (1.1x)
64 65536 0.970 (62.9 MB/s) 1.080 (56.5 MB/s) (0.9x)
90 46603 0.370 (232.0 MB/s) 0.310 (276.9 MB/s) (1.2x)
128 32768 1.430 (85.4 MB/s) 1.370 (89.1 MB/s) (1.0x)
185 22672 0.600 (294.0 MB/s) 0.360 (490.1 MB/s) (1.7x)
256 16384 2.010 (121.5 MB/s) 1.950 (125.2 MB/s) (1.0x)
347 12087 1.110 (298.1 MB/s) 0.550 (601.7 MB/s) (2.0x)
512 8192 3.950 (123.6 MB/s) 3.100 (157.5 MB/s) (1.3x)
831 5047 2.460 (322.2 MB/s) 0.860 (921.5 MB/s) (2.9x)
2048 2048 13.710 (142.5 MB/s) 11.320 (172.5 MB/s) (1.2x)
3981 1053 9.440 (402.2 MB/s) 1.960 (1937.0 MB/s) (4.8x)
8192 512 52.950 (147.5 MB/s) 43.840 (178.2 MB/s) (1.2x)
13488 311 48.320 (266.2 MB/s) 17.400 (739.3 MB/s) (2.8x)
16384 256 106.520 (146.7 MB/s) 89.950 (173.7 MB/s) (1.2x)
38893 108 218.540 (169.7 MB/s) 159.490 (232.6 MB/s) (1.4x)
65536 64 431.160 (145.0 MB/s) 349.550 (178.8 MB/s) (1.2x)
105001 40 620.610 (161.4 MB/s) 470.980 (212.6 MB/s) (1.3x)
262144 16 1691.410 (147.8 MB/s) 1387.750 (180.1 MB/s) (1.2x)
600000 7 3827.830 (149.5 MB/s) 3122.020 (183.3 MB/s) (1.2x)
1134355 4 7697.620 (140.5 MB/s) 6415.220 (168.6 MB/s) (1.2x)
2097152 2 14585.110 (137.1 MB/s) 11960.200 (167.2 MB/s) (1.2x)


Top
   
PostPosted: Mon Mar 14, 2005 9:37 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
I just finished memrchr() and memmove(). The code is in the CVS.
Actually there's also memcmp(), seems to work, but it's terribly slow, I must work on it a little more to see what I'm doing wrong.

memcmp() does exist on libmotovec and i'm trying to read the code now to see how it's done there, because its performance is what's expected from altivec :-). If someone could look at memcmp() and explain the process there, I'd be more than grateful. I tried to read it but as I think I said too many times, I'm not an assembly guy :-)

Anyway, here are some benchmarks for memrchr() and memmove():
Code:
$ ./altivectorize -v -s -g --tests memrchr --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memrchr
test: memrchr
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.050 (133.5 MB/s) 0.100 (66.8 MB/s) (0.5x)
13 325000 0.160 (77.5 MB/s) 0.100 (124.0 MB/s) (1.6x)
16 262144 0.090 (169.5 MB/s) 0.140 (109.0 MB/s) (0.6x)
20 209715 0.100 (190.7 MB/s) 0.240 (79.5 MB/s) (0.4x)
27 155344 0.110 (234.1 MB/s) 0.180 (143.1 MB/s) (0.6x)
35 119837 0.160 (208.6 MB/s) 0.190 (175.7 MB/s) (0.8x)
43 97542 0.180 (227.8 MB/s) 0.210 (195.3 MB/s) (0.9x)
54 77672 0.190 (271.0 MB/s) 0.190 (271.0 MB/s) (1.0x)
64 65536 0.200 (305.2 MB/s) 0.250 (244.1 MB/s) (0.8x)
90 46603 0.290 (296.0 MB/s) 0.230 (373.2 MB/s) (1.3x)
128 32768 0.360 (339.1 MB/s) 0.290 (420.9 MB/s) (1.2x)
185 22672 0.500 (352.9 MB/s) 0.270 (653.4 MB/s) (1.9x)
256 16384 0.660 (369.9 MB/s) 0.320 (762.9 MB/s) (2.1x)
347 12087 0.870 (380.4 MB/s) 0.310 (1067.5 MB/s) (2.8x)
512 8192 1.180 (413.8 MB/s) 0.420 (1162.6 MB/s) (2.8x)
831 5047 2.100 (377.4 MB/s) 0.600 (1320.8 MB/s) (3.5x)
2048 2048 5.090 (383.7 MB/s) 1.090 (1791.9 MB/s) (4.7x)
3981 1053 9.850 (385.4 MB/s) 1.960 (1937.0 MB/s) (5.0x)
8192 512 19.960 (391.4 MB/s) 4.330 (1804.3 MB/s) (4.6x)
13488 311 32.860 (391.5 MB/s) 6.670 (1928.5 MB/s) (4.9x)
16384 256 40.020 (390.4 MB/s) 7.880 (1982.9 MB/s) (5.1x)
38893 108 95.040 (390.3 MB/s) 20.350 (1822.7 MB/s) (4.7x)
(more...)

$ ./altivectorize -v -s -g --tests memmove --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memmove
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.130 (51.4 MB/s) 0.120 (55.6 MB/s) (1.1x)
13 325000 0.150 (82.7 MB/s) 0.150 (82.7 MB/s) (1.0x)
16 262144 0.150 (101.7 MB/s) 0.160 (95.4 MB/s) (0.9x)
20 209715 0.160 (119.2 MB/s) 0.230 (82.9 MB/s) (0.7x)
27 155344 0.160 (160.9 MB/s) 0.170 (151.5 MB/s) (0.9x)
35 119837 0.180 (185.4 MB/s) 0.160 (208.6 MB/s) (1.1x)
43 97542 0.200 (205.0 MB/s) 0.180 (227.8 MB/s) (1.1x)
54 77672 0.210 (245.2 MB/s) 0.150 (343.3 MB/s) (1.4x)
64 65536 0.180 (339.1 MB/s) 0.220 (277.4 MB/s) (0.8x)
90 46603 0.210 (408.7 MB/s) 0.200 (429.2 MB/s) (1.1x)
128 32768 0.240 (508.6 MB/s) 0.230 (530.7 MB/s) (1.0x)
185 22672 0.270 (653.4 MB/s) 0.180 (980.2 MB/s) (1.5x)
256 16384 0.350 (697.5 MB/s) 0.280 (871.9 MB/s) (1.2x)
347 12087 0.420 (787.9 MB/s) 0.220 (1504.2 MB/s) (1.9x)
512 8192 0.560 (871.9 MB/s) 0.350 (1395.1 MB/s) (1.6x)
831 5047 0.930 (852.2 MB/s) 0.400 (1981.3 MB/s) (2.3x)
2048 2048 2.130 (917.0 MB/s) 0.960 (2034.5 MB/s) (2.2x)
3981 1053 3.770 (1007.0 MB/s) 1.050 (3615.8 MB/s) (3.6x)
8192 512 8.960 (871.9 MB/s) 3.340 (2339.1 MB/s) (2.7x)
13488 311 15.190 (846.8 MB/s) 5.410 (2377.7 MB/s) (2.8x)
16384 256 17.670 (884.3 MB/s) 6.760 (2311.4 MB/s) (2.6x)
38893 108 38.800 (956.0 MB/s) 17.900 (2072.1 MB/s) (2.2x)
(more...)


Top
   
PostPosted: Tue Mar 15, 2005 3:02 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
I think memcmp() is tricky, because there are three varying conditions even before you compary any two chars to each other: alignments of either source and string length. I see some danger of doing way too many conditional branches.

My first attempt would be to treat both strings as unaligned and handle them the usual way (vec_lvsl(ptr), vec_ld(0, ptr), vec_ld(15, ptr), vec_perm). Compare for equality (vec_all_eq) and loop until string length is reached. Special care has to be taken for the last partial vector; it might be easiest to do that in scalar code (but that would mean no speedup whatsoever for string sizes < 16). To vectorize this, I'd use a boolean mask to AND with, setting all extraneous chars to zero. The mask can probably be created by comparing the result of vec_lvsl or vec_lvsr of the length (cast to void*) to an appropriate threshold.

If at any time the vec_all_eq is false, you need to locate the first offending char position. This is one of the few things that AltiVec has no direct support for. I suspect it might be fastest to do that in scalar code, four bytes at a time with unsigned int compares. All G4 and G5 PPC models have reasonably efficient support for unaligned scalar loads, so performance should be okay. In the worst case, four scalar comparisons have to be done to find the difference in the last quarter of a vector.


Top
   
PostPosted: Tue Mar 15, 2005 12:22 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
This is the memcmp() i wrote:
Code:
int memcmp_vec_aligned(const void *s1, const void *s2, size_t len) {
unsigned char* s1first = s1;
unsigned char* s2first = s2;

vector unsigned char s1vector, s2vector;

//printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len);
while (len >= 16) {
s1vector = vec_ld(0, s1first);
s2vector = vec_ld(0, s2first);
if (vec_any_ne(s1vector, s2vector)) {
//printf("Vectors not equal, calling memcmp\n");
return memcmp(s1first, s2first, 16);
}
s1first += 16;
s2first += 16;
len -= 16;
//printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len);
}

if (len > 0) {
//printf("Stuff left, calling memcmp()\n");
return memcmp(s1first, s2first, len);
}
return 0;
}

int memcmp_vec(const void *s1, const void *s2, size_t len) {

if (len <= MEMCMP_THRESHOLD) {
return memcmp(s1, s2, len);
}

// for inline char -> vector char conversion
unsigned char* s1first = s1;
unsigned char* s2first = s2;
unsigned char *temp1;

int temp2, result;
int s1offset = (int)s1first & 15;
int s2offset = (int)s2first & 15;

vector unsigned char s1vector, s2vector;
vector unsigned char MSQ, LSQ;
vector unsigned char mask;

#ifdef DEBUG
printf("srcfirst = %x\tsrclast = %x\n", s1first, s1first+len-1);
printf("dstfirst = %x\tdstlast = %x\n", s2first, s2first+len-1);
printf("s1offset = %ld\ts2offset = %ld\n", s1offset, s2offset);
#endif

if (s1offset == 0 && s2offset == 0) {
// Both buffers are 16-byte aligned, just compare the aligned values
return memcmp_vec_aligned(s1first, s2first, len);
} else {
// Use standard memcmp to copy the few bytes at the beginning
result = memcmp(s1first, s2first, s1offset);
if (result)
return result;

// Advance both pointers appropriately, now s1first is 16-byte aligned
s1first += s1offset;
s2first += s1offset;
len -= s1offset;
// recalculate the s2offset
s2offset = 16 - ((int)s2first & 15);
if (s2offset == 0) {
return memcmp_vec_aligned(s1first, s2first, len);
}

while (len >= 16) {
MSQ = vec_ld(0, s2first); // most significant quadword
LSQ = vec_ld(15, s2first); // least significant quadword
mask = vec_lvsl(0, s2first); // create the permute mask
if (vec_any_ne(vec_ld(0, s1first), vec_perm(MSQ, LSQ, mask)))
return memcmp(s1first, s2first, 16);
s1first += 16;
s2first += 16;
len -= 16;
}

if (len > 0) {
return memcmp(s1first, s2first, len);
}
}
return 0;
}
and these are the results i got:
Code:
$ ./altivectorize -v -s -g --tests memcmp --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memcmp
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.020 (333.8 MB/s) 0.100 (66.8 MB/s) (0.2x)
13 325000 0.030 (413.3 MB/s) 0.130 (95.4 MB/s) (0.2x)
16 262144 0.020 (762.9 MB/s) 0.130 (117.4 MB/s) (0.2x)
20 209715 0.030 (635.8 MB/s) 0.170 (112.2 MB/s) (0.2x)
27 155344 0.040 (643.7 MB/s) 0.190 (135.5 MB/s) (0.2x)
35 119837 0.040 (834.5 MB/s) 0.140 (238.4 MB/s) (0.3x)
43 97542 0.050 (820.2 MB/s) 0.190 (215.8 MB/s) (0.3x)
54 77672 0.040 (1287.5 MB/s) 0.160 (321.9 MB/s) (0.2x)
64 65536 0.040 (1525.9 MB/s) 0.140 (436.0 MB/s) (0.3x)
90 46603 0.060 (1430.5 MB/s) 0.200 (429.2 MB/s) (0.3x)
128 32768 0.070 (1743.9 MB/s) 0.160 (762.9 MB/s) (0.4x)
185 22672 0.090 (1960.3 MB/s) 0.210 (840.1 MB/s) (0.4x)
256 16384 0.120 (2034.5 MB/s) 0.210 (1162.6 MB/s) (0.6x)
347 12087 0.170 (1946.6 MB/s) 0.300 (1103.1 MB/s) (0.6x)
512 8192 0.250 (1953.1 MB/s) 0.320 (1525.9 MB/s) (0.8x)
831 5047 0.390 (2032.1 MB/s) 0.520 (1524.0 MB/s) (0.8x)
2048 2048 0.920 (2123.0 MB/s) 1.150 (1698.4 MB/s) (0.8x)
3981 1053 1.990 (1907.8 MB/s) 2.570 (1477.3 MB/s) (0.8x)
8192 512 3.630 (2152.2 MB/s) 5.060 (1544.0 MB/s) (0.7x)
13488 311 6.250 (2058.1 MB/s) 10.020 (1283.7 MB/s) (0.6x)
16384 256 7.940 (1967.9 MB/s) 10.630 (1469.9 MB/s) (0.7x)
Now, I really can't understand why I get such abysmal performance. :-(


Top
   
PostPosted: Thu Mar 17, 2005 9:06 am 
Offline

Joined: Thu Mar 17, 2005 8:53 am
Posts: 1
I thought about the scalar memcmp a bit, and I figure that it probably unrolls the loop, and could theoretically do 16 bytes every 8 or 9 cycles. With the improvements I suggested on IRC, the unaligned case is going to look something like this in assembly:

lvx s1
lvx s2+16
permute
compare
branch
decrement len (and update CR)
increment index
branch

On a G4, this will take at least 10 cycles per iteration. More, actually, since the G4 will have to take a cycle to redirect flow on the looping branch.

This means that the scalar code will outperform the simple vector loop. You'll have to unroll the loop to beat it. Something like this:

a1 = vec_ld(0,s1)
LSQ = vec_ld(16,s2)
a2 = vec_ld(16,s1)
b2 = vec_ld(32,s2)

if (vec_any_ne(a1, vec_perm(MSQ, LSQ, mask)))
return memcmp(s1, s2, 16)

if (vec_any_ne(a2, vec_perm(LSQ, b2, mask)))
return memcmp(s1+16, s1+16, 16)

s1+=32
s2+=32
len -= 16

(loop)

Of course, this needs to be scheduled properly by the compiler, but we'll just hope it does that automatically.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 6 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 19 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group