Power Developer
https://powerdeveloper.org/forums/

Altivec benchmark (what a novel title :-)
https://powerdeveloper.org/forums/viewtopic.php?f=23&t=167
Page 1 of 1

Author:  markos [ Mon Mar 07, 2005 2:33 am ]
Post subject:  Altivec benchmark (what a novel title :-)

Ok, I finished the altivec benchmark code, with some first test of the strfill() routine (which is actually memset() with a '\0' in the end).
Here is the output of the program:
Code:
$ ./altivectorize -v -s -g --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
#size arrays scalar glibc altivec (Effective bandwidth)
7 599186 0.030 (222.5 MB/s) 0.060 (111.3 MB/s) 0.140 (47.7 MB/s)
13 325000 0.090 (137.8 MB/s) 0.060 (206.6 MB/s) 0.100 (124.0 MB/s)
16 262144 0.090 (169.5 MB/s) 0.050 (305.2 MB/s) 0.080 (190.7 MB/s)
20 209715 0.100 (190.7 MB/s) 0.040 (476.8 MB/s) 0.080 (238.4 MB/s)
27 155344 0.110 (234.1 MB/s) 0.050 (515.0 MB/s) 0.090 (286.1 MB/s)
35 119837 0.120 (278.2 MB/s) 0.050 (667.6 MB/s) 0.070 (476.8 MB/s)
43 97542 0.130 (315.4 MB/s) 0.060 (683.5 MB/s) 0.070 (585.8 MB/s)
54 77672 0.140 (367.8 MB/s) 0.070 (735.7 MB/s) 0.080 (643.7 MB/s)
64 65536 0.150 (406.9 MB/s) 0.060 (1017.3 MB/s) 0.080 (762.9 MB/s)
90 46603 0.180 (476.8 MB/s) 0.080 (1072.9 MB/s) 0.080 (1072.9 MB/s)
128 32768 0.230 (530.7 MB/s) 0.090 (1356.3 MB/s) 0.080 (1525.9 MB/s)
185 22672 0.320 (551.3 MB/s) 0.100 (1764.3 MB/s) 0.100 (1764.3 MB/s)
256 16384 0.400 (610.4 MB/s) 0.130 (1878.0 MB/s) 0.110 (2219.5 MB/s)
347 12087 0.530 (624.4 MB/s) 0.160 (2068.3 MB/s) 0.120 (2757.7 MB/s)
512 8192 0.880 (554.9 MB/s) 0.260 (1878.0 MB/s) 0.150 (3255.2 MB/s)
831 5047 1.930 (410.6 MB/s) 0.410 (1932.9 MB/s) 0.170 (4661.8 MB/s)
2048 2048 3.410 (572.8 MB/s) 0.800 (2441.4 MB/s) 0.260 (7512.0 MB/s)
3981 1053 5.540 (685.3 MB/s) 1.710 (2220.2 MB/s) 0.460 (8253.4 MB/s)
8192 512 11.240 (695.1 MB/s) 3.110 (2512.1 MB/s) 0.790 (9889.2 MB/s)
13488 311 18.690 (688.2 MB/s) 5.580 (2305.2 MB/s) 1.240 (10373.5 MB/s)
16384 256 22.840 (684.1 MB/s) 6.730 (2321.7 MB/s) 1.430 (10926.6 MB/s)
38893 108 65.790 (563.8 MB/s) 20.240 (1832.6 MB/s) 14.860 (2496.0 MB/s)
65536 64 111.540 (560.3 MB/s) 36.530 (1710.9 MB/s) 25.530 (2448.1 MB/s)
105001 40 179.650 (557.4 MB/s) 55.760 (1795.9 MB/s) 40.730 (2458.6 MB/s)
262144 16 456.450 (547.7 MB/s) 149.500 (1672.2 MB/s) 118.930 (2102.1 MB/s)
600000 7 1824.510 (313.6 MB/s) 1528.040 (374.5 MB/s) 779.820 (733.8 MB/s)
1134355 4 4706.650 (229.8 MB/s) 4936.750 (219.1 MB/s) 2651.260 (408.0 MB/s)
2097152 2 9408.000 (212.6 MB/s) 10181.350 (196.4 MB/s) 6009.540 (332.8 MB/s)
And this is for data that gets picked randomly from a large pool, so that the chance of it existing in the cache is minimised.
Code:
$ ./altivectorize -v -s -g --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
#size arrays scalar glibc altivec (Effective bandwidth)
7 599186 0.210 (31.8 MB/s) 0.160 (41.7 MB/s) 0.200 (33.4 MB/s)
13 325000 0.220 (56.4 MB/s) 0.160 (77.5 MB/s) 0.200 (62.0 MB/s)
16 262144 0.600 (25.4 MB/s) 0.690 (22.1 MB/s) 0.560 (27.2 MB/s)
20 209715 0.220 (86.7 MB/s) 0.150 (127.2 MB/s) 0.190 (100.4 MB/s)
27 155344 0.210 (122.6 MB/s) 0.150 (171.7 MB/s) 0.200 (128.7 MB/s)
35 119837 0.390 (85.6 MB/s) 0.170 (196.3 MB/s) 0.170 (196.3 MB/s)
43 97542 0.330 (124.3 MB/s) 0.200 (205.0 MB/s) 0.210 (195.3 MB/s)
54 77672 0.290 (177.6 MB/s) 0.420 (122.6 MB/s) 0.220 (234.1 MB/s)
64 65536 0.940 (64.9 MB/s) 1.150 (53.1 MB/s) 0.950 (64.2 MB/s)
90 46603 0.260 (330.1 MB/s) 0.190 (451.7 MB/s) 0.190 (451.7 MB/s)
128 32768 1.090 (112.0 MB/s) 1.110 (110.0 MB/s) 0.850 (143.6 MB/s)
185 22672 0.370 (476.8 MB/s) 0.220 (802.0 MB/s) 0.190 (928.6 MB/s)
256 16384 1.660 (147.1 MB/s) 2.010 (121.5 MB/s) 1.330 (183.6 MB/s)
347 12087 0.850 (389.3 MB/s) 0.390 (848.5 MB/s) 0.310 (1067.5 MB/s)
512 8192 2.880 (169.5 MB/s) 3.260 (149.8 MB/s) 2.560 (190.7 MB/s)
831 5047 1.680 (471.7 MB/s) 0.660 (1200.8 MB/s) 0.450 (1761.1 MB/s)
2048 2048 9.380 (208.2 MB/s) 9.760 (200.1 MB/s) 5.080 (384.5 MB/s)
3981 1053 6.330 (599.8 MB/s) 1.860 (2041.2 MB/s) 1.070 (3548.2 MB/s)
8192 512 35.400 (220.7 MB/s) 36.160 (216.1 MB/s) 19.630 (398.0 MB/s)
13488 311 28.640 (449.1 MB/s) 15.610 (824.0 MB/s) 7.800 (1649.1 MB/s)
16384 256 70.920 (220.3 MB/s) 72.020 (217.0 MB/s) 38.100 (410.1 MB/s)
38893 108 138.070 (268.6 MB/s) 137.350 (270.0 MB/s) 70.470 (526.3 MB/s)
65536 64 282.470 (221.3 MB/s) 294.320 (212.4 MB/s) 154.810 (403.7 MB/s)
105001 40 405.400 (247.0 MB/s) 397.400 (252.0 MB/s) 204.320 (490.1 MB/s)
262144 16 1105.890 (226.1 MB/s) 1169.290 (213.8 MB/s) 613.710 (407.4 MB/s)
600000 7 2488.380 (230.0 MB/s) 2632.060 (217.4 MB/s) 1361.240 (420.4 MB/s)
1134355 4 4963.660 (217.9 MB/s) 5405.220 (200.1 MB/s) 2860.420 (378.2 MB/s)
2097152 2 9470.490 (211.2 MB/s) 10690.570 (187.1 MB/s) 5541.520 (360.9 MB/s)
It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster. I'll probably create some graphs with this data to post here.

The code is part of the pegasos project in alioth (http://alioth.debian.org/projects/pegasos/), and available from anonymous cvs right now:
Code:
cvs -z3 -d:pserver:anonymous@cvs.alioth.debian.org:/cvsroot/pegasos co altivectorize
but today i'll spend some time converting it to svn, so by tomorrow, you should use:
Code:
svn co svn://svn.d-i.alioth.debian.org/svn/pegasos altivectorize
(yes, i know, lame name but after a couple of beers it seemed fine at the time :-/)

Apart from the altivec routines, this benchmark is written so as to autodetect Altivec and use the appropriate routine if available. It compiles also on x86, but of course no Altivec there. :-)

However, it would be useful to see if/how it works on a G3 for example.

The Altivec detection works in 3 steps:
a) detect if gcc supports -maltivec and -mabi=altivec (compile time)
b) detect altivec.h (compile time)
c) detect if PPC_FEATURE_HAS_ALTIVEC is enabled in AT_HWCAP (run time).

So, comments, suggestions and flames welcome :-)

Konstantinos

Author:  hobold [ Mon Mar 07, 2005 3:57 am ]
Post subject:  Re: Altivec benchmark (what a novel title :-)

Quote:
It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster.
This is probably due to the store miss merging feature of the hardware. I guess that glibc uses the old trick with floating point registers to store 8 bytes at a time, but that still requires four store instructions to fill a single cache line. Apparently those cannot be collapsed into a single write burst transaction, but two vector stores can.

Scalar code would have to use explicit hache hints (dcbz) to prevent the read transaction caused by cache line allocation.

Author:  Neko [ Mon Mar 07, 2005 5:34 am ]
Post subject:  Re: Altivec benchmark (what a novel title :-)

Quote:
Quote:
It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster.
This is probably due to the store miss merging feature of the hardware. I guess that glibc uses the old trick with floating point registers to store 8 bytes at a time, but that still requires four store instructions to fill a single cache line. Apparently those cannot be collapsed into a single write burst transaction, but two vector stores can.

Scalar code would have to use explicit hache hints (dcbz) to prevent the read transaction caused by cache line allocation.
Would it be safe in generic code like glibc to use Data Streams?

I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not.

Neko

Author:  markos [ Mon Mar 07, 2005 5:53 am ]
Post subject:  Re: Altivec benchmark (what a novel title :-)

Quote:
Would it be safe in generic code like glibc to use Data Streams?

I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not.
One of the tests i want to make is running two tests that use altivec in parallel. And see what the effect of heavily using Altivec in 2 processes can have in each one. I don't expect 50% performance of course, but I'm not expecting tremendous drops either. Still, numbers will speak the truth :-)

Author:  hobold [ Mon Mar 07, 2005 7:22 am ]
Post subject:  Re: Altivec benchmark (what a novel title :-)

Quote:
Would it be safe in generic code like glibc to use Data Streams?
Probably not, because the dst instruction (and its companions) are part of AltiVec.
Quote:
I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not.
You always pass a stream identifier with each prefetch instruction. There are four streams that you must manage yourself. The convention is that the OS uses stream numbers from 3 downward, while user code uses stream number 0 upward. This minimizes collisions.

There are also scalar prefetch instructions that are present on most PPC models (there is a subset that is architected to be present on any PPC model). But using those has its own set of pitfalls, because they depend on cache line size. In the most extreme case programs can break when cache lines are of unexpected size. This is particularly true for dcbz which is architected to have the visible side effect of zeroing out a cache line.

Author:  Neko [ Mon Mar 07, 2005 5:36 pm ]
Post subject:  Re: Altivec benchmark (what a novel title :-)

Quote:
Quote:
Would it be safe in generic code like glibc to use Data Streams?
Probably not, because the dst instruction (and its companions) are part of AltiVec.
Well the whole point of these benchmarks is to roll the AV code in. For a large buffer where the data isn't going to fit in the cache, data streams are going to help here specifically.
Quote:
convention is that the OS uses stream numbers from 3 downward, while user code uses stream number 0 upward. This minimizes collisions.
2 each then at worst case. And if you reuse an already existing stream, it's re-appropriated..? That's what I would define as unfriendly.. unless there's some context to it.
Quote:
There are also scalar prefetch instructions that are present on most PPC models (there is a subset that is architected to be present on any PPC model).
I always wondered what the big difference between data streams and the cache handling functionality was. Is it just a concession to define some cache prefetch stuff beyond the PowerPC subset defined as the minimum, or is there an advantage over using them?

Apple and IBM are discouraging the use of datastreams on the G5 in favour of the "old" cache handling. Probably because IBM botched it in my opinion :)

Neko

Author:  hobold [ Tue Mar 08, 2005 5:27 am ]
Post subject:  Re: Altivec benchmark (what a novel title :-)

Quote:
And if you reuse an already existing stream, it's re-appropriated..? That's what I would define as unfriendly.. unless there's some context to it.
Yep, new orders for one of the prefetch engines override the previous stream. And there is no architected context. BUT.

The rule for using stream prefetch is to prefetch small overlapping blocks. This has several effects:

- the stream is kept in synch with the computation
- the stream is restarted quickly after an interruption
- the stream does not pollute too much cache ahead of time
=> effectively, the running program _is_ the stream context

I should add that a data stream can be stopped at any time for no particular reason. For example all streams will stop at page borders, because it would require a trip to the MMU to determine the subsequent physical address.

The stream prefetch engines are a fairly low level feature, so the software will have to make up for some limitations of the hardware. Consider the stream prefetch to be a combination of hardware and software. The prefetch instructions must be used in a certain way to make the best use of this feature. We are not on a CISC processor here.
Quote:
I always wondered what the big difference between data streams and the cache handling functionality was. Is it just a concession to define some cache prefetch stuff beyond the PowerPC subset defined as the minimum, or is there an advantage over using them?
The main difference between traditional prefetch instructions and AltiVec data streams is that the streams are even more asynchronous than the 'data cache block ???' instructions. There is no guarantee how quickly a stream prefetch instruction can fulfill its task. You do have the advantage of saving instruction bandwidth as compared to issuing a cache block touch for every cache line, which is particularly important on CPU models with a single LSU. But the stream prefetch hardware cannot do address translation, so it is limited to a physical page when it increments or decrements its internal pointer. The scalar prefetch instructions always run their address through the MMU.
Quote:
Apple and IBM are discouraging the use of datastreams on the G5 in favour of the "old" cache handling. Probably because IBM botched it in my opinion :)
Well, there are still cases when data streams are beneficial on a G5, but they are implemented in a fairly limited way. At the same time, PPC970 has automatic stream detection for sequential accesses. Then there are its huge cache lines of 128 bytes, which pretty much fit the description of "small block of data". So there is less of a need for software directed data streams on G5, because each scalar prefetch instruction can pull in a lot more data.

I guess one could conclude that data streams were too low level, because they are not equally useful over a wide range of processor models. OTOH, I like the ability to specify block sizes and strides; pulling in a single cache line is just not as powerful.

Page 1 of 1 All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/