Power Developer • memchr() vectorized too

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

memchr() vectorized too

Post new topic Reply to topic

Page 1 of 1

[ 6 posts ]

Print view

Previous topic | Next topic

Author

Message

markos

Post subject: memchr() vectorized too

PostPosted: Sat Mar 12, 2005 7:21 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

I just finished memchr(), the code will be commited shortly to the cvs. The nice thing is that strlen is pretty much the same to memchr() so we have 2 vectorized functions actually

Code:

$ ./altivectorize -v -s -g --tests memchr --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memchr

test: memchr

#size   arrays  glibc                   altivec (Effective bandwidth)

7       599186  0.050 (133.5 MB/s)      0.090 (74.2 MB/s) (0.6x)

13      325000  0.120 (103.3 MB/s)      0.100 (124.0 MB/s) (1.2x)

16      262144  0.080 (190.7 MB/s)      0.100 (152.6 MB/s) (0.8x)

20      209715  0.130 (146.7 MB/s)      0.160 (119.2 MB/s) (0.8x)

27      155344  0.140 (183.9 MB/s)      0.180 (143.1 MB/s) (0.8x)

35      119837  0.160 (208.6 MB/s)      0.180 (185.4 MB/s) (0.9x)

43      97542   0.120 (341.7 MB/s)      0.210 (195.3 MB/s) (0.6x)

54      77672   0.170 (302.9 MB/s)      0.190 (271.0 MB/s) (0.9x)

64      65536   0.200 (305.2 MB/s)      0.210 (290.6 MB/s) (1.0x)

90      46603   0.250 (343.3 MB/s)      0.200 (429.2 MB/s) (1.2x)

128     32768   0.340 (359.0 MB/s)      0.220 (554.9 MB/s) (1.5x)

185     22672   0.470 (375.4 MB/s)      0.280 (630.1 MB/s) (1.7x)

256     16384   0.640 (381.5 MB/s)      0.290 (841.9 MB/s) (2.2x)

347     12087   0.850 (389.3 MB/s)      0.340 (973.3 MB/s) (2.5x)

512     8192    1.170 (417.3 MB/s)      0.350 (1395.1 MB/s) (3.3x)

831     5047    1.840 (430.7 MB/s)      0.500 (1585.0 MB/s) (3.7x)

2048    2048    4.770 (409.5 MB/s)      1.060 (1842.6 MB/s) (4.5x)

3981    1053    8.920 (425.6 MB/s)      1.890 (2008.8 MB/s) (4.7x)

8192    512     18.420 (424.1 MB/s)     3.530 (2213.2 MB/s) (5.2x)

13488   311     30.380 (423.4 MB/s)     5.960 (2158.2 MB/s) (5.1x)

16384   256     38.550 (405.3 MB/s)     7.230 (2161.1 MB/s) (5.3x)

38893   108     92.030 (403.0 MB/s)     17.510 (2118.3 MB/s) (5.3x)

65536   64      153.320 (407.6 MB/s)    33.110 (1887.6 MB/s) (4.6x)

105001  40      252.450 (396.7 MB/s)    48.970 (2044.9 MB/s) (5.2x)

262144  16      738.860 (338.4 MB/s)    205.150 (1218.6 MB/s) (3.6x)

600000  7       3137.570 (182.4 MB/s)   2000.820 (286.0 MB/s) (1.6x)

1134355 4       7318.030 (147.8 MB/s)   6049.960 (178.8 MB/s) (1.2x)

2097152 2       14181.550 (141.0 MB/s)  12319.100 (162.3 MB/s) (1.2x)

I think, that from now on i'll just post the benchmarks, with very few comments

Top

Profile

Reply with quote

markos

Post subject: Re: memchr() vectorized too

PostPosted: Sun Mar 13, 2005 7:32 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Same results without the --norandom flag (minimize cache hits).

Code:

$ ./altivectorize -v -s -g --tests memchr --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memchr

test: memchr

#size   arrays  glibc                   altivec (Effective bandwidth)

7       599186  0.160 (41.7 MB/s)       0.190 (35.1 MB/s) (0.8x)

13      325000  0.230 (53.9 MB/s)       0.220 (56.4 MB/s) (1.0x)

16      262144  0.610 (25.0 MB/s)       0.560 (27.2 MB/s) (1.1x)

20      209715  0.240 (79.5 MB/s)       0.280 (68.1 MB/s) (0.9x)

27      155344  0.210 (122.6 MB/s)      0.280 (92.0 MB/s) (0.7x)

35      119837  0.270 (123.6 MB/s)      0.260 (128.4 MB/s) (1.0x)

43      97542   0.290 (141.4 MB/s)      0.290 (141.4 MB/s) (1.0x)

54      77672   0.320 (160.9 MB/s)      0.300 (171.7 MB/s) (1.1x)

64      65536   0.970 (62.9 MB/s)       1.080 (56.5 MB/s) (0.9x)

90      46603   0.370 (232.0 MB/s)      0.310 (276.9 MB/s) (1.2x)

128     32768   1.430 (85.4 MB/s)       1.370 (89.1 MB/s) (1.0x)

185     22672   0.600 (294.0 MB/s)      0.360 (490.1 MB/s) (1.7x)

256     16384   2.010 (121.5 MB/s)      1.950 (125.2 MB/s) (1.0x)

347     12087   1.110 (298.1 MB/s)      0.550 (601.7 MB/s) (2.0x)

512     8192    3.950 (123.6 MB/s)      3.100 (157.5 MB/s) (1.3x)

831     5047    2.460 (322.2 MB/s)      0.860 (921.5 MB/s) (2.9x)

2048    2048    13.710 (142.5 MB/s)     11.320 (172.5 MB/s) (1.2x)

3981    1053    9.440 (402.2 MB/s)      1.960 (1937.0 MB/s) (4.8x)

8192    512     52.950 (147.5 MB/s)     43.840 (178.2 MB/s) (1.2x)

13488   311     48.320 (266.2 MB/s)     17.400 (739.3 MB/s) (2.8x)

16384   256     106.520 (146.7 MB/s)    89.950 (173.7 MB/s) (1.2x)

38893   108     218.540 (169.7 MB/s)    159.490 (232.6 MB/s) (1.4x)

65536   64      431.160 (145.0 MB/s)    349.550 (178.8 MB/s) (1.2x)

105001  40      620.610 (161.4 MB/s)    470.980 (212.6 MB/s) (1.3x)

262144  16      1691.410 (147.8 MB/s)   1387.750 (180.1 MB/s) (1.2x)

600000  7       3827.830 (149.5 MB/s)   3122.020 (183.3 MB/s) (1.2x)

1134355 4       7697.620 (140.5 MB/s)   6415.220 (168.6 MB/s) (1.2x)

2097152 2       14585.110 (137.1 MB/s)  11960.200 (167.2 MB/s) (1.2x)

Top

Profile

Reply with quote

markos

Post subject: memrchr(), and memmove() vectorized.

PostPosted: Mon Mar 14, 2005 9:37 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

I just finished memrchr() and memmove(). The code is in the CVS.
Actually there's also memcmp(), seems to work, but it's terribly slow, I must work on it a little more to see what I'm doing wrong.

memcmp() does exist on libmotovec and i'm trying to read the code now to see how it's done there, because its performance is what's expected from altivec

. If someone could look at memcmp() and explain the process there, I'd be more than grateful. I tried to read it but as I think I said too many times, I'm not an assembly guy

Anyway, here are some benchmarks for memrchr() and memmove():

Code:

$ ./altivectorize -v -s -g --tests memrchr --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memrchr

test: memrchr

#size   arrays  glibc                   altivec (Effective bandwidth)

7       599186  0.050 (133.5 MB/s)      0.100 (66.8 MB/s) (0.5x)

13      325000  0.160 (77.5 MB/s)       0.100 (124.0 MB/s) (1.6x)

16      262144  0.090 (169.5 MB/s)      0.140 (109.0 MB/s) (0.6x)

20      209715  0.100 (190.7 MB/s)      0.240 (79.5 MB/s) (0.4x)

27      155344  0.110 (234.1 MB/s)      0.180 (143.1 MB/s) (0.6x)

35      119837  0.160 (208.6 MB/s)      0.190 (175.7 MB/s) (0.8x)

43      97542   0.180 (227.8 MB/s)      0.210 (195.3 MB/s) (0.9x)

54      77672   0.190 (271.0 MB/s)      0.190 (271.0 MB/s) (1.0x)

64      65536   0.200 (305.2 MB/s)      0.250 (244.1 MB/s) (0.8x)

90      46603   0.290 (296.0 MB/s)      0.230 (373.2 MB/s) (1.3x)

128     32768   0.360 (339.1 MB/s)      0.290 (420.9 MB/s) (1.2x)

185     22672   0.500 (352.9 MB/s)      0.270 (653.4 MB/s) (1.9x)

256     16384   0.660 (369.9 MB/s)      0.320 (762.9 MB/s) (2.1x)

347     12087   0.870 (380.4 MB/s)      0.310 (1067.5 MB/s) (2.8x)

512     8192    1.180 (413.8 MB/s)      0.420 (1162.6 MB/s) (2.8x)

831     5047    2.100 (377.4 MB/s)      0.600 (1320.8 MB/s) (3.5x)

2048    2048    5.090 (383.7 MB/s)      1.090 (1791.9 MB/s) (4.7x)

3981    1053    9.850 (385.4 MB/s)      1.960 (1937.0 MB/s) (5.0x)

8192    512     19.960 (391.4 MB/s)     4.330 (1804.3 MB/s) (4.6x)

13488   311     32.860 (391.5 MB/s)     6.670 (1928.5 MB/s) (4.9x)

16384   256     40.020 (390.4 MB/s)     7.880 (1982.9 MB/s) (5.1x)

38893   108     95.040 (390.3 MB/s)     20.350 (1822.7 MB/s) (4.7x)

(more...)

$ ./altivectorize -v -s -g --tests memmove --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memmove

#size   arrays  glibc                   altivec (Effective bandwidth)

7       599186  0.130 (51.4 MB/s)       0.120 (55.6 MB/s) (1.1x)

13      325000  0.150 (82.7 MB/s)       0.150 (82.7 MB/s) (1.0x)

16      262144  0.150 (101.7 MB/s)      0.160 (95.4 MB/s) (0.9x)

20      209715  0.160 (119.2 MB/s)      0.230 (82.9 MB/s) (0.7x)

27      155344  0.160 (160.9 MB/s)      0.170 (151.5 MB/s) (0.9x)

35      119837  0.180 (185.4 MB/s)      0.160 (208.6 MB/s) (1.1x)

43      97542   0.200 (205.0 MB/s)      0.180 (227.8 MB/s) (1.1x)

54      77672   0.210 (245.2 MB/s)      0.150 (343.3 MB/s) (1.4x)

64      65536   0.180 (339.1 MB/s)      0.220 (277.4 MB/s) (0.8x)

90      46603   0.210 (408.7 MB/s)      0.200 (429.2 MB/s) (1.1x)

128     32768   0.240 (508.6 MB/s)      0.230 (530.7 MB/s) (1.0x)

185     22672   0.270 (653.4 MB/s)      0.180 (980.2 MB/s) (1.5x)

256     16384   0.350 (697.5 MB/s)      0.280 (871.9 MB/s) (1.2x)

347     12087   0.420 (787.9 MB/s)      0.220 (1504.2 MB/s) (1.9x)

512     8192    0.560 (871.9 MB/s)      0.350 (1395.1 MB/s) (1.6x)

831     5047    0.930 (852.2 MB/s)      0.400 (1981.3 MB/s) (2.3x)

2048    2048    2.130 (917.0 MB/s)      0.960 (2034.5 MB/s) (2.2x)

3981    1053    3.770 (1007.0 MB/s)     1.050 (3615.8 MB/s) (3.6x)

8192    512     8.960 (871.9 MB/s)      3.340 (2339.1 MB/s) (2.7x)

13488   311     15.190 (846.8 MB/s)     5.410 (2377.7 MB/s) (2.8x)

16384   256     17.670 (884.3 MB/s)     6.760 (2311.4 MB/s) (2.6x)

38893   108     38.800 (956.0 MB/s)     17.900 (2072.1 MB/s) (2.2x)

(more...)

Top

Profile

Reply with quote

hobold

Post subject: Re: memchr() vectorized too

PostPosted: Tue Mar 15, 2005 3:02 am

Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 38

I think memcmp() is tricky, because there are three varying conditions even before you compary any two chars to each other: alignments of either source and string length. I see some danger of doing way too many conditional branches.

My first attempt would be to treat both strings as unaligned and handle them the usual way (vec_lvsl(ptr), vec_ld(0, ptr), vec_ld(15, ptr), vec_perm). Compare for equality (vec_all_eq) and loop until string length is reached. Special care has to be taken for the last partial vector; it might be easiest to do that in scalar code (but that would mean no speedup whatsoever for string sizes < 16). To vectorize this, I'd use a boolean mask to AND with, setting all extraneous chars to zero. The mask can probably be created by comparing the result of vec_lvsl or vec_lvsr of the length (cast to void*) to an appropriate threshold.

If at any time the vec_all_eq is false, you need to locate the first offending char position. This is one of the few things that AltiVec has no direct support for. I suspect it might be fastest to do that in scalar code, four bytes at a time with unsigned int compares. All G4 and G5 PPC models have reasonably efficient support for unaligned scalar loads, so performance should be okay. In the worst case, four scalar comparisons have to be done to find the difference in the last quarter of a vector.

Top

Profile

Reply with quote

markos

Post subject: Re: memchr() vectorized too

PostPosted: Tue Mar 15, 2005 12:22 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

This is the memcmp() i wrote:

Code:

int memcmp_vec_aligned(const void *s1, const void *s2, size_t len) {

    unsigned char* s1first = s1;    

    unsigned char* s2first = s2;

    vector unsigned char s1vector, s2vector; 

    //printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len);

    while (len >= 16) {

        s1vector = vec_ld(0, s1first);

        s2vector = vec_ld(0, s2first);

        if (vec_any_ne(s1vector, s2vector)) {

            //printf("Vectors not equal, calling memcmp\n");

            return memcmp(s1first, s2first, 16);

        }

        s1first += 16;

        s2first += 16;

        len -= 16;

        //printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len);

    }

    if (len > 0) {

        //printf("Stuff left, calling memcmp()\n");

        return memcmp(s1first, s2first, len);

    }

    return 0;

}

int memcmp_vec(const void *s1, const void *s2, size_t len) {

    if (len <= MEMCMP_THRESHOLD) {

        return memcmp(s1, s2, len);

    }

    // for inline  char -> vector char  conversion 

    unsigned char* s1first = s1;    

    unsigned char* s2first = s2;

    unsigned char *temp1;

    int temp2, result;

    int s1offset = (int)s1first & 15;

    int s2offset = (int)s2first & 15;

    vector unsigned char s1vector, s2vector; 

    vector unsigned char MSQ, LSQ;

    vector unsigned char mask;

#ifdef DEBUG

    printf("srcfirst = %x\tsrclast = %x\n", s1first, s1first+len-1);

    printf("dstfirst = %x\tdstlast = %x\n", s2first, s2first+len-1);

    printf("s1offset = %ld\ts2offset = %ld\n", s1offset, s2offset);

#endif

    if (s1offset == 0 && s2offset == 0) {

        // Both buffers are 16-byte aligned, just compare the aligned values

        return memcmp_vec_aligned(s1first, s2first, len);

    } else {    

        // Use standard memcmp to copy the few bytes at the beginning

        result = memcmp(s1first, s2first, s1offset);

        if (result)

            return result;

        // Advance both pointers appropriately, now s1first is 16-byte aligned

        s1first += s1offset;

        s2first += s1offset;

        len -= s1offset;

        // recalculate the s2offset

        s2offset = 16 - ((int)s2first & 15);

        if (s2offset == 0) {

            return memcmp_vec_aligned(s1first, s2first, len);

        }

        while (len >= 16) {

            MSQ = vec_ld(0, s2first);            // most significant quadword

            LSQ = vec_ld(15, s2first);           // least significant quadword

            mask = vec_lvsl(0, s2first);         // create the permute mask

            if (vec_any_ne(vec_ld(0, s1first), vec_perm(MSQ, LSQ, mask)))

                return memcmp(s1first, s2first, 16);

            s1first += 16;

            s2first += 16;

            len -= 16;

        }

        if (len > 0) {

            return memcmp(s1first, s2first, len);

        }

    }

    return 0; 

}

and these are the results i got:

Code:

$ ./altivectorize -v -s -g --tests memcmp --norandom --loops 1000000

Altivec is supported

Verbose mode on

Will do both scalar and vector tests

Will also do glibc tests

loops: 1000000

output file:

will do tests: memcmp

#size   arrays  glibc                   altivec (Effective bandwidth)

     599186  0.020 (333.8 MB/s)      0.100 (66.8 MB/s) (0.2x)

    325000  0.030 (413.3 MB/s)      0.130 (95.4 MB/s) (0.2x)

    262144  0.020 (762.9 MB/s)      0.130 (117.4 MB/s) (0.2x)

    209715  0.030 (635.8 MB/s)      0.170 (112.2 MB/s) (0.2x)

    155344  0.040 (643.7 MB/s)      0.190 (135.5 MB/s) (0.2x)

    119837  0.040 (834.5 MB/s)      0.140 (238.4 MB/s) (0.3x)

    97542   0.050 (820.2 MB/s)      0.190 (215.8 MB/s) (0.3x)

    77672   0.040 (1287.5 MB/s)     0.160 (321.9 MB/s) (0.2x)

    65536   0.040 (1525.9 MB/s)     0.140 (436.0 MB/s) (0.3x)

    46603   0.060 (1430.5 MB/s)     0.200 (429.2 MB/s) (0.3x)

   32768   0.070 (1743.9 MB/s)     0.160 (762.9 MB/s) (0.4x)

   22672   0.090 (1960.3 MB/s)     0.210 (840.1 MB/s) (0.4x)

   16384   0.120 (2034.5 MB/s)     0.210 (1162.6 MB/s) (0.6x)

   12087   0.170 (1946.6 MB/s)     0.300 (1103.1 MB/s) (0.6x)

   8192    0.250 (1953.1 MB/s)     0.320 (1525.9 MB/s) (0.8x)

   5047    0.390 (2032.1 MB/s)     0.520 (1524.0 MB/s) (0.8x)

  2048    0.920 (2123.0 MB/s)     1.150 (1698.4 MB/s) (0.8x)

  1053    1.990 (1907.8 MB/s)     2.570 (1477.3 MB/s) (0.8x)

  512     3.630 (2152.2 MB/s)     5.060 (1544.0 MB/s) (0.7x)

 311     6.250 (2058.1 MB/s)     10.020 (1283.7 MB/s) (0.6x)

 256     7.940 (1967.9 MB/s)     10.630 (1469.9 MB/s) (0.7x)

Now, I really can't understand why I get such abysmal performance.

Top

Profile

Reply with quote

afleming

Post subject: Re: memchr() vectorized too

PostPosted: Thu Mar 17, 2005 9:06 am

Offline

Joined: Thu Mar 17, 2005 8:53 am
Posts: 1

I thought about the scalar memcmp a bit, and I figure that it probably unrolls the loop, and could theoretically do 16 bytes every 8 or 9 cycles. With the improvements I suggested on IRC, the unaligned case is going to look something like this in assembly:

lvx s1
lvx s2+16
permute
compare
branch
decrement len (and update CR)
increment index
branch

On a G4, this will take at least 10 cycles per iteration. More, actually, since the G4 will have to take a cycle to redirect flow on the looping branch.

This means that the scalar code will outperform the simple vector loop. You'll have to unroll the loop to beat it. Something like this:

a1 = vec_ld(0,s1)
LSQ = vec_ld(16,s2)
a2 = vec_ld(16,s1)
b2 = vec_ld(32,s2)

if (vec_any_ne(a1, vec_perm(MSQ, LSQ, mask)))
return memcmp(s1, s2, 16)

if (vec_any_ne(a2, vec_perm(LSQ, b2, mask)))
return memcmp(s1+16, s1+16, 16)

s1+=32
s2+=32
len -= 16

(loop)

Of course, this needs to be scheduled properly by the compiler, but we'll just hope it does that automatically.

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 1 of 1

[ 6 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum