Power Developer :: Altivec benchmark (what a novel title :-)

Author:

markos [ Mon Mar 07, 2005 2:33 am ]

Post subject:

Altivec benchmark (what a novel title :-)

Ok, I finished the altivec benchmark code, with some first test of the strfill() routine (which is actually memset() with a '\0' in the end).
Here is the output of the program:

Code:

$ ./altivectorize -v -s -g --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

#size   arrays  scalar                  glibc                   altivec (Effective bandwidth)

7       599186  0.030 (222.5 MB/s)      0.060 (111.3 MB/s)      0.140 (47.7 MB/s)

13      325000  0.090 (137.8 MB/s)      0.060 (206.6 MB/s)      0.100 (124.0 MB/s)

16      262144  0.090 (169.5 MB/s)      0.050 (305.2 MB/s)      0.080 (190.7 MB/s)

20      209715  0.100 (190.7 MB/s)      0.040 (476.8 MB/s)      0.080 (238.4 MB/s)

27      155344  0.110 (234.1 MB/s)      0.050 (515.0 MB/s)      0.090 (286.1 MB/s)

35      119837  0.120 (278.2 MB/s)      0.050 (667.6 MB/s)      0.070 (476.8 MB/s)

43      97542   0.130 (315.4 MB/s)      0.060 (683.5 MB/s)      0.070 (585.8 MB/s)

54      77672   0.140 (367.8 MB/s)      0.070 (735.7 MB/s)      0.080 (643.7 MB/s)

64      65536   0.150 (406.9 MB/s)      0.060 (1017.3 MB/s)     0.080 (762.9 MB/s)

90      46603   0.180 (476.8 MB/s)      0.080 (1072.9 MB/s)     0.080 (1072.9 MB/s)

128     32768   0.230 (530.7 MB/s)      0.090 (1356.3 MB/s)     0.080 (1525.9 MB/s)

185     22672   0.320 (551.3 MB/s)      0.100 (1764.3 MB/s)     0.100 (1764.3 MB/s)

256     16384   0.400 (610.4 MB/s)      0.130 (1878.0 MB/s)     0.110 (2219.5 MB/s)

347     12087   0.530 (624.4 MB/s)      0.160 (2068.3 MB/s)     0.120 (2757.7 MB/s)

512     8192    0.880 (554.9 MB/s)      0.260 (1878.0 MB/s)     0.150 (3255.2 MB/s)

831     5047    1.930 (410.6 MB/s)      0.410 (1932.9 MB/s)     0.170 (4661.8 MB/s)

2048    2048    3.410 (572.8 MB/s)      0.800 (2441.4 MB/s)     0.260 (7512.0 MB/s)

3981    1053    5.540 (685.3 MB/s)      1.710 (2220.2 MB/s)     0.460 (8253.4 MB/s)

8192    512     11.240 (695.1 MB/s)     3.110 (2512.1 MB/s)     0.790 (9889.2 MB/s)

13488   311     18.690 (688.2 MB/s)     5.580 (2305.2 MB/s)     1.240 (10373.5 MB/s)

16384   256     22.840 (684.1 MB/s)     6.730 (2321.7 MB/s)     1.430 (10926.6 MB/s)

38893   108     65.790 (563.8 MB/s)     20.240 (1832.6 MB/s)    14.860 (2496.0 MB/s)

65536   64      111.540 (560.3 MB/s)    36.530 (1710.9 MB/s)    25.530 (2448.1 MB/s)

105001  40      179.650 (557.4 MB/s)    55.760 (1795.9 MB/s)    40.730 (2458.6 MB/s)

262144  16      456.450 (547.7 MB/s)    149.500 (1672.2 MB/s)   118.930 (2102.1 MB/s)

600000  7       1824.510 (313.6 MB/s)   1528.040 (374.5 MB/s)   779.820 (733.8 MB/s)

1134355 4       4706.650 (229.8 MB/s)   4936.750 (219.1 MB/s)   2651.260 (408.0 MB/s)

2097152 2       9408.000 (212.6 MB/s)   10181.350 (196.4 MB/s)  6009.540 (332.8 MB/s)

And this is for data that gets picked randomly from a large pool, so that the chance of it existing in the cache is minimised.

Code:

$ ./altivectorize -v -s -g --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

#size   arrays  scalar                  glibc                   altivec (Effective bandwidth)

7       599186  0.210 (31.8 MB/s)       0.160 (41.7 MB/s)       0.200 (33.4 MB/s)

13      325000  0.220 (56.4 MB/s)       0.160 (77.5 MB/s)       0.200 (62.0 MB/s)

16      262144  0.600 (25.4 MB/s)       0.690 (22.1 MB/s)       0.560 (27.2 MB/s)

20      209715  0.220 (86.7 MB/s)       0.150 (127.2 MB/s)      0.190 (100.4 MB/s)

27      155344  0.210 (122.6 MB/s)      0.150 (171.7 MB/s)      0.200 (128.7 MB/s)

35      119837  0.390 (85.6 MB/s)       0.170 (196.3 MB/s)      0.170 (196.3 MB/s)

43      97542   0.330 (124.3 MB/s)      0.200 (205.0 MB/s)      0.210 (195.3 MB/s)

54      77672   0.290 (177.6 MB/s)      0.420 (122.6 MB/s)      0.220 (234.1 MB/s)

64      65536   0.940 (64.9 MB/s)       1.150 (53.1 MB/s)       0.950 (64.2 MB/s)

90      46603   0.260 (330.1 MB/s)      0.190 (451.7 MB/s)      0.190 (451.7 MB/s)

128     32768   1.090 (112.0 MB/s)      1.110 (110.0 MB/s)      0.850 (143.6 MB/s)

185     22672   0.370 (476.8 MB/s)      0.220 (802.0 MB/s)      0.190 (928.6 MB/s)

256     16384   1.660 (147.1 MB/s)      2.010 (121.5 MB/s)      1.330 (183.6 MB/s)

347     12087   0.850 (389.3 MB/s)      0.390 (848.5 MB/s)      0.310 (1067.5 MB/s)

512     8192    2.880 (169.5 MB/s)      3.260 (149.8 MB/s)      2.560 (190.7 MB/s)

831     5047    1.680 (471.7 MB/s)      0.660 (1200.8 MB/s)     0.450 (1761.1 MB/s)

2048    2048    9.380 (208.2 MB/s)      9.760 (200.1 MB/s)      5.080 (384.5 MB/s)

3981    1053    6.330 (599.8 MB/s)      1.860 (2041.2 MB/s)     1.070 (3548.2 MB/s)

8192    512     35.400 (220.7 MB/s)     36.160 (216.1 MB/s)     19.630 (398.0 MB/s)

13488   311     28.640 (449.1 MB/s)     15.610 (824.0 MB/s)     7.800 (1649.1 MB/s)

16384   256     70.920 (220.3 MB/s)     72.020 (217.0 MB/s)     38.100 (410.1 MB/s)

38893   108     138.070 (268.6 MB/s)    137.350 (270.0 MB/s)    70.470 (526.3 MB/s)

65536   64      282.470 (221.3 MB/s)    294.320 (212.4 MB/s)    154.810 (403.7 MB/s)

105001  40      405.400 (247.0 MB/s)    397.400 (252.0 MB/s)    204.320 (490.1 MB/s)

262144  16      1105.890 (226.1 MB/s)   1169.290 (213.8 MB/s)   613.710 (407.4 MB/s)

600000  7       2488.380 (230.0 MB/s)   2632.060 (217.4 MB/s)   1361.240 (420.4 MB/s)

1134355 4       4963.660 (217.9 MB/s)   5405.220 (200.1 MB/s)   2860.420 (378.2 MB/s)

2097152 2       9470.490 (211.2 MB/s)   10690.570 (187.1 MB/s)  5541.520 (360.9 MB/s)

It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster. I'll probably create some graphs with this data to post here.

The code is part of the pegasos project in alioth (http://alioth.debian.org/projects/pegasos/), and available from anonymous cvs right now:

Code:

cvs -z3 -d:pserver:anonymous@cvs.alioth.debian.org:/cvsroot/pegasos co altivectorize

but today i'll spend some time converting it to svn, so by tomorrow, you should use:

Code:

svn co svn://svn.d-i.alioth.debian.org/svn/pegasos altivectorize

(yes, i know, lame name but after a couple of beers it seemed fine at the time :-/)

Apart from the altivec routines, this benchmark is written so as to autodetect Altivec and use the appropriate routine if available. It compiles also on x86, but of course no Altivec there.

However, it would be useful to see if/how it works on a G3 for example.

The Altivec detection works in 3 steps:
a) detect if gcc supports -maltivec and -mabi=altivec (compile time)
b) detect altivec.h (compile time)
c) detect if PPC_FEATURE_HAS_ALTIVEC is enabled in AT_HWCAP (run time).

So, comments, suggestions and flames welcome

Konstantinos

Author:	hobold [ Mon Mar 07, 2005 3:57 am ]
Post subject:	Re: Altivec benchmark (what a novel title :-)
Quote: It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster. This is probably due to the store miss merging feature of the hardware. I guess that glibc uses the old trick with floating point registers to store 8 bytes at a time, but that still requires four store instructions to fill a single cache line. Apparently those cannot be collapsed into a single write burst transaction, but two vector stores can. Scalar code would have to use explicit hache hints (dcbz) to prevent the read transaction caused by cache line allocation.

Author:	Neko [ Mon Mar 07, 2005 5:34 am ]
Post subject:	Re: Altivec benchmark (what a novel title :-)
Quote: Quote: It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster. This is probably due to the store miss merging feature of the hardware. I guess that glibc uses the old trick with floating point registers to store 8 bytes at a time, but that still requires four store instructions to fill a single cache line. Apparently those cannot be collapsed into a single write burst transaction, but two vector stores can. Scalar code would have to use explicit hache hints (dcbz) to prevent the read transaction caused by cache line allocation. Would it be safe in generic code like glibc to use Data Streams? I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not. Neko

Author:	markos [ Mon Mar 07, 2005 5:53 am ]
Post subject:	Re: Altivec benchmark (what a novel title :-)
Quote: Would it be safe in generic code like glibc to use Data Streams? I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not. One of the tests i want to make is running two tests that use altivec in parallel. And see what the effect of heavily using Altivec in 2 processes can have in each one. I don't expect 50% performance of course, but I'm not expecting tremendous drops either. Still, numbers will speak the truth

Author:	hobold [ Mon Mar 07, 2005 7:22 am ]
Post subject:	Re: Altivec benchmark (what a novel title :-)
Quote: Would it be safe in generic code like glibc to use Data Streams? Probably not, because the dst instruction (and its companions) are part of AltiVec. Quote: I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not. You always pass a stream identifier with each prefetch instruction. There are four streams that you must manage yourself. The convention is that the OS uses stream numbers from 3 downward, while user code uses stream number 0 upward. This minimizes collisions. There are also scalar prefetch instructions that are present on most PPC models (there is a subset that is architected to be present on any PPC model). But using those has its own set of pitfalls, because they depend on cache line size. In the most extreme case programs can break when cache lines are of unexpected size. This is particularly true for dcbz which is architected to have the visible side effect of zeroing out a cache line.

Author:	Neko [ Mon Mar 07, 2005 5:36 pm ]
Post subject:	Re: Altivec benchmark (what a novel title :-)
Quote: Quote: Would it be safe in generic code like glibc to use Data Streams? Probably not, because the dst instruction (and its companions) are part of AltiVec. Well the whole point of these benchmarks is to roll the AV code in. For a large buffer where the data isn't going to fit in the cache, data streams are going to help here specifically. Quote: convention is that the OS uses stream numbers from 3 downward, while user code uses stream number 0 upward. This minimizes collisions. 2 each then at worst case. And if you reuse an already existing stream, it's re-appropriated..? That's what I would define as unfriendly.. unless there's some context to it. Quote: There are also scalar prefetch instructions that are present on most PPC models (there is a subset that is architected to be present on any PPC model). I always wondered what the big difference between data streams and the cache handling functionality was. Is it just a concession to define some cache prefetch stuff beyond the PowerPC subset defined as the minimum, or is there an advantage over using them? Apple and IBM are discouraging the use of datastreams on the G5 in favour of the "old" cache handling. Probably because IBM botched it in my opinion Neko

Power Developer https://powerdeveloper.org/forums/

Altivec benchmark (what a novel title :-) https://powerdeveloper.org/forums/viewtopic.php?f=23&t=167	Page 1 of 1

Author:	hobold [ Tue Mar 08, 2005 5:27 am ]
Post subject:	Re: Altivec benchmark (what a novel title :-)
Quote: And if you reuse an already existing stream, it's re-appropriated..? That's what I would define as unfriendly.. unless there's some context to it. Yep, new orders for one of the prefetch engines override the previous stream. And there is no architected context. BUT. The rule for using stream prefetch is to prefetch small overlapping blocks. This has several effects: - the stream is kept in synch with the computation - the stream is restarted quickly after an interruption - the stream does not pollute too much cache ahead of time => effectively, the running program _is_ the stream context I should add that a data stream can be stopped at any time for no particular reason. For example all streams will stop at page borders, because it would require a trip to the MMU to determine the subsequent physical address. The stream prefetch engines are a fairly low level feature, so the software will have to make up for some limitations of the hardware. Consider the stream prefetch to be a combination of hardware and software. The prefetch instructions must be used in a certain way to make the best use of this feature. We are not on a CISC processor here. Quote: I always wondered what the big difference between data streams and the cache handling functionality was. Is it just a concession to define some cache prefetch stuff beyond the PowerPC subset defined as the minimum, or is there an advantage over using them? The main difference between traditional prefetch instructions and AltiVec data streams is that the streams are even more asynchronous than the 'data cache block ???' instructions. There is no guarantee how quickly a stream prefetch instruction can fulfill its task. You do have the advantage of saving instruction bandwidth as compared to issuing a cache block touch for every cache line, which is particularly important on CPU models with a single LSU. But the stream prefetch hardware cannot do address translation, so it is limited to a physical page when it increments or decrements its internal pointer. The scalar prefetch instructions always run their address through the MMU. Quote: Apple and IBM are discouraging the use of datastreams on the G5 in favour of the "old" cache handling. Probably because IBM botched it in my opinion Well, there are still cases when data streams are beneficial on a G5, but they are implemented in a fairly limited way. At the same time, PPC970 has automatic stream detection for sequential accesses. Then there are its huge cache lines of 128 bytes, which pretty much fit the description of "small block of data". So there is less of a need for software directed data streams on G5, because each scalar prefetch instruction can pull in a lot more data. I guess one could conclude that data streams were too low level, because they are not equally useful over a wide range of processor models. OTOH, I like the ability to specify block sizes and strides; pulling in a single cache line is just not as powerful.

Page 1 of 1	All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/