Power Developer • memcpy() vectorized (plus benchmarks)

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

memcpy() vectorized (plus benchmarks)

Post new topic Reply to topic

Page 1 of 1

[ 5 posts ]

Print view

Previous topic | Next topic

Author

Message

markos

Post subject: memcpy() vectorized (plus benchmarks)

PostPosted: Thu Mar 10, 2005 10:38 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Here are benchmarks from memcpy().
I'll post a comparison to libmotovec right after.

Code:

$ ./altivectorize -v -s -g --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memcpy

#size   arrays  glibc                   altivec (Effective bandwidth)

7       599186  0.130 (51.4 MB/s)       0.090 (74.2 MB/s)

13      325000  0.140 (88.6 MB/s)       0.110 (112.7 MB/s)

16      262144  0.140 (109.0 MB/s)      0.150 (101.7 MB/s)

20      209715  0.140 (136.2 MB/s)      0.210 (90.8 MB/s)

27      155344  0.140 (183.9 MB/s)      0.150 (171.7 MB/s)

35      119837  0.150 (222.5 MB/s)      0.140 (238.4 MB/s)

43      97542   0.170 (241.2 MB/s)      0.160 (256.3 MB/s)

54      77672   0.180 (286.1 MB/s)      0.140 (367.8 MB/s)

64      65536   0.160 (381.5 MB/s)      0.130 (469.5 MB/s)

90      46603   0.190 (451.7 MB/s)      0.180 (476.8 MB/s)

128     32768   0.210 (581.3 MB/s)      0.140 (871.9 MB/s)

185     22672   0.250 (705.7 MB/s)      0.170 (1037.8 MB/s)

256     16384   0.320 (762.9 MB/s)      0.180 (1356.3 MB/s)

347     12087   0.370 (894.4 MB/s)      0.200 (1654.6 MB/s)

512     8192    0.520 (939.0 MB/s)      0.250 (1953.1 MB/s)

831     5047    0.770 (1029.2 MB/s)     0.350 (2264.3 MB/s)

2048    2048    1.940 (1006.8 MB/s)     0.760 (2569.9 MB/s)

3981    1053    3.580 (1060.5 MB/s)     0.980 (3874.1 MB/s)

8192    512     7.080 (1103.5 MB/s)     2.560 (3051.8 MB/s)

13488   311     11.450 (1123.4 MB/s)    4.220 (3048.1 MB/s)

16384   256     14.040 (1112.9 MB/s)    4.980 (3137.6 MB/s)

38893   108     34.160 (1085.8 MB/s)    16.700 (2221.0 MB/s)

65536   64      62.160 (1005.5 MB/s)    30.040 (2080.6 MB/s)

105001  40      101.610 (985.5 MB/s)    43.490 (2302.5 MB/s)

262144  16      259.710 (962.6 MB/s)    121.090 (2064.6 MB/s)

600000  7       1300.250 (440.1 MB/s)   939.560 (609.0 MB/s)

1134355 4       3007.870 (359.7 MB/s)   2913.140 (371.4 MB/s)

2097152 2       6011.970 (332.7 MB/s)   5841.110 (342.4 MB/s)

The code is available in the same cvs repo as before:

Code:

$ cvs -z3 -d:pserver:anonymous@cvs.alioth.debian.org:/cvsroot/pegasos co altivectorize

As for the code, you'll notice that it's quite fast even for small sizes (some times even faster). Also, since alignment issues are taken care of by using the original memcpy() for copying the offset bytes, you'll notice that the speed of the routine is pretty much constant regardless of alignment.

Top

Profile

Reply with quote

markos

Post subject: Re: memcpy() vectorized (plus benchmarks)

PostPosted: Fri Mar 11, 2005 8:07 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

I'll post a comparison to libmotovec right after.

I linked the benchmark app to libmotovec, so that it used libmotovec's memcpy() as the default one (in the glibc column).

Code:

$ ./altivectorize -v -s -g --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memcpy

#size   arrays  glibc                   altivec (Effective bandwidth)

7       599186  0.090 (74.2 MB/s)       0.080 (83.4 MB/s)

13      325000  0.110 (112.7 MB/s)      0.100 (124.0 MB/s)

16      262144  0.130 (117.4 MB/s)      0.100 (152.6 MB/s)

20      209715  0.110 (173.4 MB/s)      0.160 (119.2 MB/s)

27      155344  0.120 (214.6 MB/s)      0.130 (198.1 MB/s)

35      119837  0.110 (303.4 MB/s)      0.100 (333.8 MB/s)

43      97542   0.130 (315.4 MB/s)      0.130 (315.4 MB/s)

54      77672   0.120 (429.2 MB/s)      0.130 (396.1 MB/s)

64      65536   0.130 (469.5 MB/s)      0.120 (508.6 MB/s)

90      46603   0.160 (536.4 MB/s)      0.180 (476.8 MB/s)

128     32768   0.180 (678.2 MB/s)      0.160 (762.9 MB/s)

185     22672   0.190 (928.6 MB/s)      0.170 (1037.8 MB/s)

256     16384   0.230 (1061.5 MB/s)     0.190 (1285.0 MB/s)

347     12087   0.230 (1438.8 MB/s)     0.210 (1575.8 MB/s)

512     8192    0.300 (1627.6 MB/s)     0.280 (1743.9 MB/s)

831     5047    0.410 (1932.9 MB/s)     0.330 (2401.5 MB/s)

2048    2048    0.780 (2504.0 MB/s)     0.800 (2441.4 MB/s)

3981    1053    1.360 (2791.6 MB/s)     0.970 (3914.0 MB/s)

8192    512     2.440 (3201.8 MB/s)     2.790 (2800.2 MB/s)

13488   311     4.310 (2984.5 MB/s)     4.200 (3062.7 MB/s)

16384   256     5.280 (2959.3 MB/s)     5.110 (3057.7 MB/s)

38893   108     16.990 (2183.1 MB/s)    15.880 (2335.7 MB/s)

65536   64      28.670 (2180.0 MB/s)    27.190 (2298.6 MB/s)

105001  40      49.370 (2028.3 MB/s)    46.730 (2142.9 MB/s)

262144  16      138.550 (1804.4 MB/s)   129.870 (1925.0 MB/s)

600000  7       1040.730 (549.8 MB/s)   1045.950 (547.1 MB/s)

1134355 4       2742.170 (394.5 MB/s)   2868.980 (377.1 MB/s)

I think the numbers speak for themselves. Who said assembly is better than C?

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Wed May 17, 2006 10:34 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Hmm I'm getting somewhat different results.

Memorybench - copying a block of 80 MB from a -> b

glibc: 366.8876 MB/sec
Freevec: 381.2597 MB/sec
Motovec: 637.9746 MB/sec
FC64: 611.4247 MB/sec

Cachebench - copying a block of 8 KB from a -> b

glibc: 2637.2605 MB/sec
Freevec: 5510.4557 MB/sec
Motovec: 7355.0409 MB/sec
FC64: 4557.8185 MB/sec

*FC64 is real simple copy loop using float registers.
Its loop unrolled to copy 64 byte (2 cache lines) per loop iteration.
The copy gets speed by using a dcbt to prefetch the next two cache lines while copying the current.

I think the improved memory throughput from 360 to 630 MB/sec
does have a huge impack on many applications.

Freevec does not improve the memory throughput as good as libmotovec.
But maye I'm doing something wrong with freevec here.

For some more PowerPC memory benchmarks see here:
glibc benchmarks

Cheers
Gunnar

Top

Profile

Reply with quote

pvdabeel

Post subject:

PostPosted: Thu May 18, 2006 9:50 pm

Offline

Joined: Thu Apr 07, 2005 10:40 am
Posts: 35

What would be the easiest way to get your applications to use these altivec optimized functions instead of the glibc provided ones?

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Fri May 19, 2006 2:08 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

What would be the easiest way to get your applications to use these altivec optimized functions instead of the glibc provided ones?

The easiest for all users and the best for Linux would be
if Linux would use the glibc functions that MAC OS X uses.
Apple has optimized functions for every PowerPC CPU (G3/G4/G5)
OS X benchmarks the functions for your system on startup and then uses the most optimal function for you.

The Apple routines are up to 80% faster than the "simpleminded" ones that linux. As the Apple source are free there is actually no excuse for Linux not to use them.

If you don't want to wait for Linux to use proper PPC functions but want to compile your application with the Altivec then you just need to include it on the linker command line prior to the compiler's libc library.
Exmaples how this can done for each compiler are in the motovec readme.

Cheers
Gunnar

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 1 of 1

[ 5 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 2 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum