Power Developer :: memcpy() vectorized (plus benchmarks)

$ ./altivectorize -v -s -g --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memcpy
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.130 (51.4 MB/s) 0.090 (74.2 MB/s)
13 325000 0.140 (88.6 MB/s) 0.110 (112.7 MB/s)
16 262144 0.140 (109.0 MB/s) 0.150 (101.7 MB/s)
20 209715 0.140 (136.2 MB/s) 0.210 (90.8 MB/s)
27 155344 0.140 (183.9 MB/s) 0.150 (171.7 MB/s)
35 119837 0.150 (222.5 MB/s) 0.140 (238.4 MB/s)
43 97542 0.170 (241.2 MB/s) 0.160 (256.3 MB/s)
54 77672 0.180 (286.1 MB/s) 0.140 (367.8 MB/s)
64 65536 0.160 (381.5 MB/s) 0.130 (469.5 MB/s)
90 46603 0.190 (451.7 MB/s) 0.180 (476.8 MB/s)
128 32768 0.210 (581.3 MB/s) 0.140 (871.9 MB/s)
185 22672 0.250 (705.7 MB/s) 0.170 (1037.8 MB/s)
256 16384 0.320 (762.9 MB/s) 0.180 (1356.3 MB/s)
347 12087 0.370 (894.4 MB/s) 0.200 (1654.6 MB/s)
512 8192 0.520 (939.0 MB/s) 0.250 (1953.1 MB/s)
831 5047 0.770 (1029.2 MB/s) 0.350 (2264.3 MB/s)
2048 2048 1.940 (1006.8 MB/s) 0.760 (2569.9 MB/s)
3981 1053 3.580 (1060.5 MB/s) 0.980 (3874.1 MB/s)
8192 512 7.080 (1103.5 MB/s) 2.560 (3051.8 MB/s)
13488 311 11.450 (1123.4 MB/s) 4.220 (3048.1 MB/s)
16384 256 14.040 (1112.9 MB/s) 4.980 (3137.6 MB/s)
38893 108 34.160 (1085.8 MB/s) 16.700 (2221.0 MB/s)
65536 64 62.160 (1005.5 MB/s) 30.040 (2080.6 MB/s)
105001 40 101.610 (985.5 MB/s) 43.490 (2302.5 MB/s)
262144 16 259.710 (962.6 MB/s) 121.090 (2064.6 MB/s)
600000 7 1300.250 (440.1 MB/s) 939.560 (609.0 MB/s)
1134355 4 3007.870 (359.7 MB/s) 2913.140 (371.4 MB/s)
2097152 2 6011.970 (332.7 MB/s) 5841.110 (342.4 MB/s)

$ cvs -z3 -d:pserver:anonymous@cvs.alioth.debian.org:/cvsroot/pegasos co altivectorize

$ ./altivectorize -v -s -g --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memcpy
#size arrays glibc altivec (Effective bandwidth)
7 599186 0.090 (74.2 MB/s) 0.080 (83.4 MB/s)
13 325000 0.110 (112.7 MB/s) 0.100 (124.0 MB/s)
16 262144 0.130 (117.4 MB/s) 0.100 (152.6 MB/s)
20 209715 0.110 (173.4 MB/s) 0.160 (119.2 MB/s)
27 155344 0.120 (214.6 MB/s) 0.130 (198.1 MB/s)
35 119837 0.110 (303.4 MB/s) 0.100 (333.8 MB/s)
43 97542 0.130 (315.4 MB/s) 0.130 (315.4 MB/s)
54 77672 0.120 (429.2 MB/s) 0.130 (396.1 MB/s)
64 65536 0.130 (469.5 MB/s) 0.120 (508.6 MB/s)
90 46603 0.160 (536.4 MB/s) 0.180 (476.8 MB/s)
128 32768 0.180 (678.2 MB/s) 0.160 (762.9 MB/s)
185 22672 0.190 (928.6 MB/s) 0.170 (1037.8 MB/s)
256 16384 0.230 (1061.5 MB/s) 0.190 (1285.0 MB/s)
347 12087 0.230 (1438.8 MB/s) 0.210 (1575.8 MB/s)
512 8192 0.300 (1627.6 MB/s) 0.280 (1743.9 MB/s)
831 5047 0.410 (1932.9 MB/s) 0.330 (2401.5 MB/s)
2048 2048 0.780 (2504.0 MB/s) 0.800 (2441.4 MB/s)
3981 1053 1.360 (2791.6 MB/s) 0.970 (3914.0 MB/s)
8192 512 2.440 (3201.8 MB/s) 2.790 (2800.2 MB/s)
13488 311 4.310 (2984.5 MB/s) 4.200 (3062.7 MB/s)
16384 256 5.280 (2959.3 MB/s) 5.110 (3057.7 MB/s)
38893 108 16.990 (2183.1 MB/s) 15.880 (2335.7 MB/s)
65536 64 28.670 (2180.0 MB/s) 27.190 (2298.6 MB/s)
105001 40 49.370 (2028.3 MB/s) 46.730 (2142.9 MB/s)
262144 16 138.550 (1804.4 MB/s) 129.870 (1925.0 MB/s)
600000 7 1040.730 (549.8 MB/s) 1045.950 (547.1 MB/s)
1134355 4 2742.170 (394.5 MB/s) 2868.980 (377.1 MB/s)

Author:	markos [ Thu Mar 10, 2005 10:38 pm ]
Post subject:	memcpy() vectorized (plus benchmarks)
Here are benchmarks from memcpy(). I'll post a comparison to libmotovec right after. Code: $ ./altivectorize -v -s -g --norandom --loops 1000000 Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memcpy #size arrays glibc altivec (Effective bandwidth) 7 599186 0.130 (51.4 MB/s) 0.090 (74.2 MB/s) 13 325000 0.140 (88.6 MB/s) 0.110 (112.7 MB/s) 16 262144 0.140 (109.0 MB/s) 0.150 (101.7 MB/s) 20 209715 0.140 (136.2 MB/s) 0.210 (90.8 MB/s) 27 155344 0.140 (183.9 MB/s) 0.150 (171.7 MB/s) 35 119837 0.150 (222.5 MB/s) 0.140 (238.4 MB/s) 43 97542 0.170 (241.2 MB/s) 0.160 (256.3 MB/s) 54 77672 0.180 (286.1 MB/s) 0.140 (367.8 MB/s) 64 65536 0.160 (381.5 MB/s) 0.130 (469.5 MB/s) 90 46603 0.190 (451.7 MB/s) 0.180 (476.8 MB/s) 128 32768 0.210 (581.3 MB/s) 0.140 (871.9 MB/s) 185 22672 0.250 (705.7 MB/s) 0.170 (1037.8 MB/s) 256 16384 0.320 (762.9 MB/s) 0.180 (1356.3 MB/s) 347 12087 0.370 (894.4 MB/s) 0.200 (1654.6 MB/s) 512 8192 0.520 (939.0 MB/s) 0.250 (1953.1 MB/s) 831 5047 0.770 (1029.2 MB/s) 0.350 (2264.3 MB/s) 2048 2048 1.940 (1006.8 MB/s) 0.760 (2569.9 MB/s) 3981 1053 3.580 (1060.5 MB/s) 0.980 (3874.1 MB/s) 8192 512 7.080 (1103.5 MB/s) 2.560 (3051.8 MB/s) 13488 311 11.450 (1123.4 MB/s) 4.220 (3048.1 MB/s) 16384 256 14.040 (1112.9 MB/s) 4.980 (3137.6 MB/s) 38893 108 34.160 (1085.8 MB/s) 16.700 (2221.0 MB/s) 65536 64 62.160 (1005.5 MB/s) 30.040 (2080.6 MB/s) 105001 40 101.610 (985.5 MB/s) 43.490 (2302.5 MB/s) 262144 16 259.710 (962.6 MB/s) 121.090 (2064.6 MB/s) 600000 7 1300.250 (440.1 MB/s) 939.560 (609.0 MB/s) 1134355 4 3007.870 (359.7 MB/s) 2913.140 (371.4 MB/s) 2097152 2 6011.970 (332.7 MB/s) 5841.110 (342.4 MB/s) The code is available in the same cvs repo as before: Code: $ cvs -z3 -d:pserver:anonymous@cvs.alioth.debian.org:/cvsroot/pegasos co altivectorize As for the code, you'll notice that it's quite fast even for small sizes (some times even faster). Also, since alignment issues are taken care of by using the original memcpy() for copying the offset bytes, you'll notice that the speed of the routine is pretty much constant regardless of alignment.

Author:	markos [ Fri Mar 11, 2005 8:07 am ]
Post subject:	Re: memcpy() vectorized (plus benchmarks)
Quote: I'll post a comparison to libmotovec right after. I linked the benchmark app to libmotovec, so that it used libmotovec's memcpy() as the default one (in the glibc column). Code: $ ./altivectorize -v -s -g --norandom --loops 1000000 Altivec is supported Verbose mode on Will do both scalar and vector tests Will also do glibc tests loops: 1000000 output file: will do tests: memcpy #size arrays glibc altivec (Effective bandwidth) 7 599186 0.090 (74.2 MB/s) 0.080 (83.4 MB/s) 13 325000 0.110 (112.7 MB/s) 0.100 (124.0 MB/s) 16 262144 0.130 (117.4 MB/s) 0.100 (152.6 MB/s) 20 209715 0.110 (173.4 MB/s) 0.160 (119.2 MB/s) 27 155344 0.120 (214.6 MB/s) 0.130 (198.1 MB/s) 35 119837 0.110 (303.4 MB/s) 0.100 (333.8 MB/s) 43 97542 0.130 (315.4 MB/s) 0.130 (315.4 MB/s) 54 77672 0.120 (429.2 MB/s) 0.130 (396.1 MB/s) 64 65536 0.130 (469.5 MB/s) 0.120 (508.6 MB/s) 90 46603 0.160 (536.4 MB/s) 0.180 (476.8 MB/s) 128 32768 0.180 (678.2 MB/s) 0.160 (762.9 MB/s) 185 22672 0.190 (928.6 MB/s) 0.170 (1037.8 MB/s) 256 16384 0.230 (1061.5 MB/s) 0.190 (1285.0 MB/s) 347 12087 0.230 (1438.8 MB/s) 0.210 (1575.8 MB/s) 512 8192 0.300 (1627.6 MB/s) 0.280 (1743.9 MB/s) 831 5047 0.410 (1932.9 MB/s) 0.330 (2401.5 MB/s) 2048 2048 0.780 (2504.0 MB/s) 0.800 (2441.4 MB/s) 3981 1053 1.360 (2791.6 MB/s) 0.970 (3914.0 MB/s) 8192 512 2.440 (3201.8 MB/s) 2.790 (2800.2 MB/s) 13488 311 4.310 (2984.5 MB/s) 4.200 (3062.7 MB/s) 16384 256 5.280 (2959.3 MB/s) 5.110 (3057.7 MB/s) 38893 108 16.990 (2183.1 MB/s) 15.880 (2335.7 MB/s) 65536 64 28.670 (2180.0 MB/s) 27.190 (2298.6 MB/s) 105001 40 49.370 (2028.3 MB/s) 46.730 (2142.9 MB/s) 262144 16 138.550 (1804.4 MB/s) 129.870 (1925.0 MB/s) 600000 7 1040.730 (549.8 MB/s) 1045.950 (547.1 MB/s) 1134355 4 2742.170 (394.5 MB/s) 2868.980 (377.1 MB/s) I think the numbers speak for themselves. Who said assembly is better than C?

Author:	gunnar [ Wed May 17, 2006 10:34 am ]
Post subject:
Hmm I'm getting somewhat different results. Memorybench - copying a block of 80 MB from a -> b glibc: 366.8876 MB/sec Freevec: 381.2597 MB/sec Motovec: 637.9746 MB/sec FC64: 611.4247 MB/sec Cachebench - copying a block of 8 KB from a -> b glibc: 2637.2605 MB/sec Freevec: 5510.4557 MB/sec Motovec: 7355.0409 MB/sec FC64: 4557.8185 MB/sec *FC64 is real simple copy loop using float registers. Its loop unrolled to copy 64 byte (2 cache lines) per loop iteration. The copy gets speed by using a dcbt to prefetch the next two cache lines while copying the current. I think the improved memory throughput from 360 to 630 MB/sec does have a huge impack on many applications. Freevec does not improve the memory throughput as good as libmotovec. But maye I'm doing something wrong with freevec here. For some more PowerPC memory benchmarks see here: glibc benchmarks Cheers Gunnar

Author:	pvdabeel [ Thu May 18, 2006 9:50 pm ]
Post subject:
What would be the easiest way to get your applications to use these altivec optimized functions instead of the glibc provided ones?

Author:	gunnar [ Fri May 19, 2006 2:08 am ]
Post subject:
Quote: What would be the easiest way to get your applications to use these altivec optimized functions instead of the glibc provided ones? The easiest for all users and the best for Linux would be if Linux would use the glibc functions that MAC OS X uses. Apple has optimized functions for every PowerPC CPU (G3/G4/G5) OS X benchmarks the functions for your system on startup and then uses the most optimal function for you. The Apple routines are up to 80% faster than the "simpleminded" ones that linux. As the Apple source are free there is actually no excuse for Linux not to use them. If you don't want to wait for Linux to use proper PPC functions but want to compile your application with the Altivec then you just need to include it on the linker command line prior to the compiler's libc library. Exmaples how this can done for each compiler are in the motovec readme. Cheers Gunnar

Power Developer https://powerdeveloper.org/forums/

memcpy() vectorized (plus benchmarks) https://powerdeveloper.org/forums/viewtopic.php?f=23&t=175	Page 1 of 1

Page 1 of 1	All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/