Ok, I finished the altivec benchmark code, with some first test of the strfill() routine (which is actually memset() with a '\0' in the end).
Here is the output of the program:
Code:
$ ./altivectorize -v -s -g --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
#size arrays scalar glibc altivec (Effective bandwidth)
7 599186 0.030 (222.5 MB/s) 0.060 (111.3 MB/s) 0.140 (47.7 MB/s)
13 325000 0.090 (137.8 MB/s) 0.060 (206.6 MB/s) 0.100 (124.0 MB/s)
16 262144 0.090 (169.5 MB/s) 0.050 (305.2 MB/s) 0.080 (190.7 MB/s)
20 209715 0.100 (190.7 MB/s) 0.040 (476.8 MB/s) 0.080 (238.4 MB/s)
27 155344 0.110 (234.1 MB/s) 0.050 (515.0 MB/s) 0.090 (286.1 MB/s)
35 119837 0.120 (278.2 MB/s) 0.050 (667.6 MB/s) 0.070 (476.8 MB/s)
43 97542 0.130 (315.4 MB/s) 0.060 (683.5 MB/s) 0.070 (585.8 MB/s)
54 77672 0.140 (367.8 MB/s) 0.070 (735.7 MB/s) 0.080 (643.7 MB/s)
64 65536 0.150 (406.9 MB/s) 0.060 (1017.3 MB/s) 0.080 (762.9 MB/s)
90 46603 0.180 (476.8 MB/s) 0.080 (1072.9 MB/s) 0.080 (1072.9 MB/s)
128 32768 0.230 (530.7 MB/s) 0.090 (1356.3 MB/s) 0.080 (1525.9 MB/s)
185 22672 0.320 (551.3 MB/s) 0.100 (1764.3 MB/s) 0.100 (1764.3 MB/s)
256 16384 0.400 (610.4 MB/s) 0.130 (1878.0 MB/s) 0.110 (2219.5 MB/s)
347 12087 0.530 (624.4 MB/s) 0.160 (2068.3 MB/s) 0.120 (2757.7 MB/s)
512 8192 0.880 (554.9 MB/s) 0.260 (1878.0 MB/s) 0.150 (3255.2 MB/s)
831 5047 1.930 (410.6 MB/s) 0.410 (1932.9 MB/s) 0.170 (4661.8 MB/s)
2048 2048 3.410 (572.8 MB/s) 0.800 (2441.4 MB/s) 0.260 (7512.0 MB/s)
3981 1053 5.540 (685.3 MB/s) 1.710 (2220.2 MB/s) 0.460 (8253.4 MB/s)
8192 512 11.240 (695.1 MB/s) 3.110 (2512.1 MB/s) 0.790 (9889.2 MB/s)
13488 311 18.690 (688.2 MB/s) 5.580 (2305.2 MB/s) 1.240 (10373.5 MB/s)
16384 256 22.840 (684.1 MB/s) 6.730 (2321.7 MB/s) 1.430 (10926.6 MB/s)
38893 108 65.790 (563.8 MB/s) 20.240 (1832.6 MB/s) 14.860 (2496.0 MB/s)
65536 64 111.540 (560.3 MB/s) 36.530 (1710.9 MB/s) 25.530 (2448.1 MB/s)
105001 40 179.650 (557.4 MB/s) 55.760 (1795.9 MB/s) 40.730 (2458.6 MB/s)
262144 16 456.450 (547.7 MB/s) 149.500 (1672.2 MB/s) 118.930 (2102.1 MB/s)
600000 7 1824.510 (313.6 MB/s) 1528.040 (374.5 MB/s) 779.820 (733.8 MB/s)
1134355 4 4706.650 (229.8 MB/s) 4936.750 (219.1 MB/s) 2651.260 (408.0 MB/s)
2097152 2 9408.000 (212.6 MB/s) 10181.350 (196.4 MB/s) 6009.540 (332.8 MB/s)
And this is for data that gets picked randomly from a large pool, so that the chance of it existing in the cache is minimised.
Code:
$ ./altivectorize -v -s -g --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
#size arrays scalar glibc altivec (Effective bandwidth)
7 599186 0.210 (31.8 MB/s) 0.160 (41.7 MB/s) 0.200 (33.4 MB/s)
13 325000 0.220 (56.4 MB/s) 0.160 (77.5 MB/s) 0.200 (62.0 MB/s)
16 262144 0.600 (25.4 MB/s) 0.690 (22.1 MB/s) 0.560 (27.2 MB/s)
20 209715 0.220 (86.7 MB/s) 0.150 (127.2 MB/s) 0.190 (100.4 MB/s)
27 155344 0.210 (122.6 MB/s) 0.150 (171.7 MB/s) 0.200 (128.7 MB/s)
35 119837 0.390 (85.6 MB/s) 0.170 (196.3 MB/s) 0.170 (196.3 MB/s)
43 97542 0.330 (124.3 MB/s) 0.200 (205.0 MB/s) 0.210 (195.3 MB/s)
54 77672 0.290 (177.6 MB/s) 0.420 (122.6 MB/s) 0.220 (234.1 MB/s)
64 65536 0.940 (64.9 MB/s) 1.150 (53.1 MB/s) 0.950 (64.2 MB/s)
90 46603 0.260 (330.1 MB/s) 0.190 (451.7 MB/s) 0.190 (451.7 MB/s)
128 32768 1.090 (112.0 MB/s) 1.110 (110.0 MB/s) 0.850 (143.6 MB/s)
185 22672 0.370 (476.8 MB/s) 0.220 (802.0 MB/s) 0.190 (928.6 MB/s)
256 16384 1.660 (147.1 MB/s) 2.010 (121.5 MB/s) 1.330 (183.6 MB/s)
347 12087 0.850 (389.3 MB/s) 0.390 (848.5 MB/s) 0.310 (1067.5 MB/s)
512 8192 2.880 (169.5 MB/s) 3.260 (149.8 MB/s) 2.560 (190.7 MB/s)
831 5047 1.680 (471.7 MB/s) 0.660 (1200.8 MB/s) 0.450 (1761.1 MB/s)
2048 2048 9.380 (208.2 MB/s) 9.760 (200.1 MB/s) 5.080 (384.5 MB/s)
3981 1053 6.330 (599.8 MB/s) 1.860 (2041.2 MB/s) 1.070 (3548.2 MB/s)
8192 512 35.400 (220.7 MB/s) 36.160 (216.1 MB/s) 19.630 (398.0 MB/s)
13488 311 28.640 (449.1 MB/s) 15.610 (824.0 MB/s) 7.800 (1649.1 MB/s)
16384 256 70.920 (220.3 MB/s) 72.020 (217.0 MB/s) 38.100 (410.1 MB/s)
38893 108 138.070 (268.6 MB/s) 137.350 (270.0 MB/s) 70.470 (526.3 MB/s)
65536 64 282.470 (221.3 MB/s) 294.320 (212.4 MB/s) 154.810 (403.7 MB/s)
105001 40 405.400 (247.0 MB/s) 397.400 (252.0 MB/s) 204.320 (490.1 MB/s)
262144 16 1105.890 (226.1 MB/s) 1169.290 (213.8 MB/s) 613.710 (407.4 MB/s)
600000 7 2488.380 (230.0 MB/s) 2632.060 (217.4 MB/s) 1361.240 (420.4 MB/s)
1134355 4 4963.660 (217.9 MB/s) 5405.220 (200.1 MB/s) 2860.420 (378.2 MB/s)
2097152 2 9470.490 (211.2 MB/s) 10690.570 (187.1 MB/s) 5541.520 (360.9 MB/s)
It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster. I'll probably create some graphs with this data to post here.
The code is part of the pegasos project in alioth (
http://alioth.debian.org/projects/pegasos/), and available from anonymous cvs right now:
Code:
cvs -z3 -d:pserver:
anonymous@cvs.alioth.debian.org:/cvsroot/pegasos co altivectorize
but today i'll spend some time converting it to svn, so by tomorrow, you should use:
Code:
svn co svn://svn.d-i.alioth.debian.org/svn/pegasos altivectorize
(yes, i know, lame name but after a couple of beers it seemed fine at the time :-/)
Apart from the altivec routines, this benchmark is written so as to autodetect Altivec and use the appropriate routine if available. It compiles also on x86, but of course no Altivec there.
However, it would be useful to see if/how it works on a G3 for example.
The Altivec detection works in 3 steps:
a) detect if gcc supports -maltivec and -mabi=altivec (compile time)
b) detect altivec.h (compile time)
c) detect if PPC_FEATURE_HAS_ALTIVEC is enabled in AT_HWCAP (run time).
So, comments, suggestions and flames welcome
Konstantinos