All times are UTC-06:00




Post new topic  Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Sat Apr 30, 2005 3:58 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Ok, I wanted to try libfreevec in some real application both for testing that it actually works and also as proof of concept. So I built from source MySQL v4.0.22 (that was the one I had already setup) and I got 3 versions from it: scalar (original), libfreevec version (statically linked mysqld with libfreevec) and libmotovec version (again statically linked mysqld with libmotovec).

Then I cd'd into sql-bench folder and ran ./test-insert which is quite demanding, does a lot of tests (which prove that lib works as expected) and takes a lot of time:

Here are the results:
Code:
$ ./test-insert --user=test
Testing server 'MySQL 4.0.24_Debian 2 log' at 2005-04-30 16:51:04

Testing the speed of inserting data into 1 table and do some selects on it.
The tests are done with a table that has 100000 rows.

Generating random keys
Creating tables
Inserting 100000 rows in order
Inserting 100000 rows in reverse order
Inserting 100000 rows in random order
Time for insert (300000): 73 wallclock secs (11.13 usr 3.21 sys + 0.00 cusr 0.00 csys = 14.34 CPU)

Testing insert of duplicates
Time for insert_duplicates (100000): 19 wallclock secs ( 3.98 usr 0.93 sys + 0.00 cusr 0.00 csys = 4.91 CPU)

Retrieving data from the table
Time for select_big (10:3000000): 50 wallclock secs (36.55 usr 4.48 sys + 0.00 cusr 0.00 csys = 41.03 CPU)
Time for order_by_big_key (10:3000000): 55 wallclock secs (37.78 usr 4.86 sys + 0.00 cusr 0.00 csys = 42.64 CPU)
Time for order_by_big_key_desc (10:3000000): 54 wallclock secs (38.20 usr 4.87 sys + 0.00 cusr 0.00 csys = 43.07 CPU)
Time for order_by_big_key_prefix (10:3000000): 51 wallclock secs (37.13 usr 4.53 sys + 0.00 cusr 0.00 csys = 41.66 CPU)
Time for order_by_big_key2 (10:3000000): 51 wallclock secs (37.29 usr 4.58 sys + 0.00 cusr 0.00 csys = 41.87 CPU)
Time for order_by_big_key_diff (10:3000000): 60 wallclock secs (37.25 usr 4.59 sys + 0.00 cusr 0.00 csys = 41.84 CPU)
Time for order_by_big (10:3000000): 60 wallclock secs (37.28 usr 4.60 sys + 0.00 cusr 0.00 csys = 41.88 CPU)
Time for order_by_range (500:125750): 6 wallclock secs ( 2.07 usr 0.18 sys + 0.00 cusr 0.00 csys = 2.25 CPU)
Time for order_by_key_prefix (500:125750): 3 wallclock secs ( 1.84 usr 0.19 sys + 0.00 cusr 0.00 csys = 2.03 CPU)
Time for order_by_key2_diff (500:250500): 6 wallclock secs ( 3.39 usr 0.37 sys + 0.00 cusr 0.00 csys = 3.76 CPU)
Time for select_diff_key (500:1000): 136 wallclock secs ( 0.50 usr 0.03 sys + 0.00 cusr 0.00 csys = 0.53 CPU)
Time for select_range_prefix (5010:42084): 3 wallclock secs ( 2.69 usr 0.23 sys + 0.00 cusr 0.00 csys = 2.92 CPU)
Time for select_range_key2 (5010:42084): 3 wallclock secs ( 2.69 usr 0.22 sys + 0.00 cusr 0.00 csys = 2.91 CPU)
Time for select_key_prefix (200000): 168 wallclock secs (81.64 usr 6.14 sys + 0.00 cusr 0.00 csys = 87.78 CPU)
Time for select_key (200000): 156 wallclock secs (78.69 usr 6.10 sys + 0.00 cusr 0.00 csys = 84.79 CPU)
Time for select_key_return_key (200000): 154 wallclock secs (77.64 usr 5.40 sys + 0.00 cusr 0.00 csys = 83.04 CPU)
Time for select_key2 (200000): 169 wallclock secs (81.65 usr 6.17 sys + 0.00 cusr 0.00 csys = 87.82 CPU)
Time for select_key2_return_key (200000): 162 wallclock secs (79.40 usr 5.28 sys + 0.00 cusr 0.00 csys = 84.68 CPU)
Time for select_key2_return_prim (200000): 165 wallclock secs (80.23 usr 5.52 sys + 0.00 cusr 0.00 csys = 85.75 CPU)

Test of compares with simple ranges
Time for select_range_prefix (20000:43500): 5 wallclock secs ( 3.15 usr 0.27 sys + 0.00 cusr 0.00 csys = 3.42 CPU)
Time for select_range_key2 (20000:43500): 4 wallclock secs ( 3.24 usr 0.27 sys + 0.00 cusr 0.00 csys = 3.51 CPU)
Time for select_group (111): 4 wallclock secs ( 0.05 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.05 CPU)
Time for min_max_on_key (15000): 9 wallclock secs ( 4.93 usr 0.38 sys + 0.00 cusr 0.00 csys = 5.31 CPU)
Time for min_max (60): 9 wallclock secs ( 0.04 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.04 CPU)
Time for count_on_key (100): 32 wallclock secs ( 0.09 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.10 CPU)
Time for count (100): 31 wallclock secs ( 0.08 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.08 CPU)
Time for count_distinct_big (20): 4 wallclock secs ( 0.01 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.01 CPU)

Testing update of keys with functions
Time for update_of_key (50000): 22 wallclock secs ( 2.23 usr 0.58 sys + 0.00 cusr 0.00 csys = 2.81 CPU)
Time for update_of_key_big (501): 19 wallclock secs ( 0.06 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.07 CPU)

Testing update with key
Time for update_with_key (300000): 63 wallclock secs ( 9.05 usr 3.11 sys + 0.00 cusr 0.00 csys = 12.16 CPU)
Time for update_with_key_prefix (100000): 21 wallclock secs ( 6.83 usr 1.20 sys + 0.00 cusr 0.00 csys = 8.03 CPU)

Testing update of all rows
Time for update_big (10): 26 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing left outer join
Time for outer_join_on_key (10:10): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join (10:10): 4 wallclock secs ( 0.01 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.01 CPU)
Time for outer_join_found (10:10): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join_not_found (500:10): 3 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing SELECT ... WHERE id in (10 values)
Time for select_in (500:5000) 0 wallclock secs ( 0.21 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.23 CPU)

Time for select_join_in (500:5000) 0 wallclock secs ( 0.20 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.22 CPU)

Testing SELECT ... WHERE id in (100 values)
Time for select_in (500:50000) 1 wallclock secs ( 0.80 usr 0.08 sys + 0.00 cusr 0.00 csys = 0.88 CPU)

Time for select_join_in (500:50000) 1 wallclock secs ( 0.81 usr 0.08 sys + 0.00 cusr 0.00 csys = 0.89 CPU)

Testing SELECT ... WHERE id in (1000 values)
Time for select_in (500:500000) 8 wallclock secs ( 6.80 usr 0.73 sys + 0.00 cusr 0.00 csys = 7.53 CPU)

Time for select_join_in (500:500000) 8 wallclock secs ( 6.78 usr 0.71 sys + 0.00 cusr 0.00 csys = 7.49 CPU)


Testing INSERT INTO ... SELECT
Time for insert_select_1_key (1): 5 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for insert_select_2_keys (1): 6 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop table(2): 0 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing delete
Time for delete_key (10000): 2 wallclock secs ( 0.38 usr 0.11 sys + 0.00 cusr 0.00 csys = 0.49 CPU)
Time for delete_range (12): 11 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Insert into table with 16 keys and with a primary key with 16 parts
Time for insert_key (100000): 582 wallclock secs (15.50 usr 2.18 sys + 0.00 cusr 0.00 csys = 17.68 CPU)

Testing update of keys
Time for update_of_primary_key_many_keys (256): 89 wallclock secs ( 0.04 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.05 CPU)

Deleting rows from the table
Time for delete_big_many_keys (128): 477 wallclock secs ( 0.03 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.04 CPU)

Deleting everything from table
Time for delete_all_many_keys (1): 477 wallclock secs ( 0.03 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.04 CPU)

Inserting 100000 rows with multiple values
Time for multiple_value_insert (100000): 4 wallclock secs ( 0.61 usr 0.04 sys + 0.00 cusr 0.00 csys = 0.65 CPU)

Time for drop table(1): 0 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Total time: 3062 wallclock secs (831.03 usr 82.31 sys + 0.00 cusr 0.00 csys = 913.34 CPU)
Results with libfreevec:
Code:
$ ./test-insert --user=test
Testing server 'MySQL 4.0.21 log/' at 2005-04-30 15:56:22

Testing the speed of inserting data into 1 table and do some selects on it.
The tests are done with a table that has 100000 rows.

Generating random keys
Creating tables
Inserting 100000 rows in order
Inserting 100000 rows in reverse order
Inserting 100000 rows in random order
Time for insert (300000): 69 wallclock secs (10.93 usr 3.05 sys + 0.00 cusr 0.00 csys = 13.98 CPU)

Testing insert of duplicates
Time for insert_duplicates (100000): 18 wallclock secs ( 3.98 usr 0.97 sys + 0.00 cusr 0.00 csys = 4.95 CPU)

Retrieving data from the table
Time for select_big (10:3000000): 51 wallclock secs (36.91 usr 4.60 sys + 0.00 cusr 0.00 csys = 41.51 CPU)
Time for order_by_big_key (10:3000000): 55 wallclock secs (38.01 usr 4.77 sys + 0.00 cusr 0.00 csys = 42.78 CPU)
Time for order_by_big_key_desc (10:3000000): 55 wallclock secs (38.40 usr 4.95 sys + 0.00 cusr 0.00 csys = 43.35 CPU)
Time for order_by_big_key_prefix (10:3000000): 52 wallclock secs (37.20 usr 4.54 sys + 0.00 cusr 0.00 csys = 41.74 CPU)
Time for order_by_big_key2 (10:3000000): 51 wallclock secs (37.51 usr 4.56 sys + 0.00 cusr 0.00 csys = 42.07 CPU)
Time for order_by_big_key_diff (10:3000000): 59 wallclock secs (37.52 usr 4.59 sys + 0.00 cusr 0.00 csys = 42.11 CPU)
Time for order_by_big (10:3000000): 61 wallclock secs (37.66 usr 4.61 sys + 0.00 cusr 0.00 csys = 42.27 CPU)
Time for order_by_range (500:125750): 7 wallclock secs ( 2.05 usr 0.22 sys + 0.00 cusr 0.00 csys = 2.27 CPU)
Time for order_by_key_prefix (500:125750): 3 wallclock secs ( 1.86 usr 0.19 sys + 0.00 cusr 0.00 csys = 2.05 CPU)
Time for order_by_key2_diff (500:250500): 5 wallclock secs ( 3.41 usr 0.35 sys + 0.00 cusr 0.00 csys = 3.76 CPU)
Time for select_diff_key (500:1000): 137 wallclock secs ( 0.49 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.51 CPU)
Time for select_range_prefix (5010:42084): 4 wallclock secs ( 2.72 usr 0.21 sys + 0.00 cusr 0.00 csys = 2.93 CPU)
Time for select_range_key2 (5010:42084): 3 wallclock secs ( 2.72 usr 0.22 sys + 0.00 cusr 0.00 csys = 2.94 CPU)
Time for select_key_prefix (200000): 165 wallclock secs (80.34 usr 6.09 sys + 0.00 cusr 0.00 csys = 86.43 CPU)
Time for select_key (200000): 155 wallclock secs (78.66 usr 6.07 sys + 0.00 cusr 0.00 csys = 84.73 CPU)
Time for select_key_return_key (200000): 155 wallclock secs (77.50 usr 5.43 sys + 0.00 cusr 0.00 csys = 82.93 CPU)
Time for select_key2 (200000): 167 wallclock secs (81.49 usr 6.06 sys + 0.00 cusr 0.00 csys = 87.55 CPU)
Time for select_key2_return_key (200000): 159 wallclock secs (78.75 usr 5.25 sys + 0.00 cusr 0.00 csys = 84.00 CPU)
Time for select_key2_return_prim (200000): 165 wallclock secs (80.52 usr 5.50 sys + 0.00 cusr 0.00 csys = 86.02 CPU)

Test of compares with simple ranges
Time for select_range_prefix (20000:43500): 4 wallclock secs ( 3.22 usr 0.25 sys + 0.00 cusr 0.00 csys = 3.47 CPU)
Time for select_range_key2 (20000:43500): 4 wallclock secs ( 3.19 usr 0.25 sys + 0.00 cusr 0.00 csys = 3.44 CPU)
Time for select_group (111): 3 wallclock secs ( 0.05 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.05 CPU)
Time for min_max_on_key (15000): 10 wallclock secs ( 4.82 usr 0.32 sys + 0.00 cusr 0.00 csys = 5.14 CPU)
Time for min_max (60): 10 wallclock secs ( 0.05 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.05 CPU)
Time for count_on_key (100): 32 wallclock secs ( 0.08 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.09 CPU)
Time for count (100): 32 wallclock secs ( 0.09 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.09 CPU)
Time for count_distinct_big (20): 3 wallclock secs ( 0.01 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)

Testing update of keys with functions
Time for update_of_key (50000): 24 wallclock secs ( 2.18 usr 0.58 sys + 0.00 cusr 0.00 csys = 2.76 CPU)
Time for update_of_key_big (501): 19 wallclock secs ( 0.04 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.06 CPU)

Testing update with key
Time for update_with_key (300000): 60 wallclock secs ( 8.81 usr 3.08 sys + 0.00 cusr 0.00 csys = 11.89 CPU)
Time for update_with_key_prefix (100000): 20 wallclock secs ( 6.36 usr 1.14 sys + 0.00 cusr 0.00 csys = 7.50 CPU)

Testing update of all rows
Time for update_big (10): 22 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing left outer join
Time for outer_join_on_key (10:10): 3 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join (10:10): 4 wallclock secs ( 0.01 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.01 CPU)
Time for outer_join_found (10:10): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join_not_found (500:10): 3 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing SELECT ... WHERE id in (10 values)
Time for select_in (500:5000) 0 wallclock secs ( 0.21 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.23 CPU)

Time for select_join_in (500:5000) 0 wallclock secs ( 0.22 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.23 CPU)

Testing SELECT ... WHERE id in (100 values)
Time for select_in (500:50000) 1 wallclock secs ( 0.81 usr 0.09 sys + 0.00 cusr 0.00 csys = 0.90 CPU)

Time for select_join_in (500:50000) 1 wallclock secs ( 0.82 usr 0.08 sys + 0.00 cusr 0.00 csys = 0.90 CPU)

Testing SELECT ... WHERE id in (1000 values)
Time for select_in (500:500000) 8 wallclock secs ( 6.98 usr 0.74 sys + 0.00 cusr 0.00 csys = 7.72 CPU)

Time for select_join_in (500:500000) 8 wallclock secs ( 6.94 usr 0.72 sys + 0.00 cusr 0.00 csys = 7.66 CPU)


Testing INSERT INTO ... SELECT
Time for insert_select_1_key (1): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for insert_select_2_keys (1): 6 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop table(2): 1 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing delete
Time for delete_key (10000): 2 wallclock secs ( 0.38 usr 0.11 sys + 0.00 cusr 0.00 csys = 0.49 CPU)
Time for delete_range (12): 10 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Insert into table with 16 keys and with a primary key with 16 parts
Time for insert_key (100000): 612 wallclock secs (15.53 usr 2.22 sys + 0.00 cusr 0.00 csys = 17.75 CPU)

Testing update of keys
Time for update_of_primary_key_many_keys (256): 79 wallclock secs ( 0.04 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.05 CPU)

Deleting rows from the table
Time for delete_big_many_keys (128): 427 wallclock secs ( 0.03 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.04 CPU)

Deleting everything from table
Time for delete_all_many_keys (1): 427 wallclock secs ( 0.03 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.04 CPU)

Inserting 100000 rows with multiple values
Time for multiple_value_insert (100000): 4 wallclock secs ( 0.60 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.62 CPU)

Time for drop table(1): 0 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Total time: 3017 wallclock secs (830.19 usr 81.95 sys + 0.00 cusr 0.00 csys = 912.14 CPU)
and libmotovec results:
Code:
$ ./test-insert --user=test
Testing server 'MySQL 4.0.21 log/' at 2005-04-30 17:48:36

Testing the speed of inserting data into 1 table and do some selects on it.
The tests are done with a table that has 100000 rows.

Generating random keys
Creating tables
Inserting 100000 rows in order
Inserting 100000 rows in reverse order
Inserting 100000 rows in random order
Time for insert (300000): 72 wallclock secs (11.64 usr 3.07 sys + 0.00 cusr 0.00 csys = 14.71 CPU)

Testing insert of duplicates
Time for insert_duplicates (100000): 20 wallclock secs ( 4.08 usr 0.98 sys + 0.00 cusr 0.00 csys = 5.06 CPU)

Retrieving data from the table
Time for select_big (10:3000000): 50 wallclock secs (36.69 usr 4.56 sys + 0.00 cusr 0.00 csys = 41.25 CPU)
Time for order_by_big_key (10:3000000): 55 wallclock secs (37.90 usr 4.92 sys + 0.00 cusr 0.00 csys = 42.82 CPU)
Time for order_by_big_key_desc (10:3000000): 54 wallclock secs (38.28 usr 4.85 sys + 0.00 cusr 0.00 csys = 43.13 CPU)
Time for order_by_big_key_prefix (10:3000000): 52 wallclock secs (37.13 usr 4.52 sys + 0.00 cusr 0.00 csys = 41.65 CPU)
Time for order_by_big_key2 (10:3000000): 51 wallclock secs (37.39 usr 4.64 sys + 0.00 cusr 0.00 csys = 42.03 CPU)
Time for order_by_big_key_diff (10:3000000): 58 wallclock secs (37.29 usr 4.58 sys + 0.00 cusr 0.00 csys = 41.87 CPU)
Time for order_by_big (10:3000000): 60 wallclock secs (37.50 usr 4.56 sys + 0.00 cusr 0.00 csys = 42.06 CPU)
Time for order_by_range (500:125750): 7 wallclock secs ( 2.03 usr 0.22 sys + 0.00 cusr 0.00 csys = 2.25 CPU)
Time for order_by_key_prefix (500:125750): 3 wallclock secs ( 1.87 usr 0.19 sys + 0.00 cusr 0.00 csys = 2.06 CPU)
Time for order_by_key2_diff (500:250500): 5 wallclock secs ( 3.40 usr 0.36 sys + 0.00 cusr 0.00 csys = 3.76 CPU)
Time for select_diff_key (500:1000): 128 wallclock secs ( 0.49 usr 0.04 sys + 0.00 cusr 0.00 csys = 0.53 CPU)
Time for select_range_prefix (5010:42084): 3 wallclock secs ( 2.67 usr 0.22 sys + 0.00 cusr 0.00 csys = 2.89 CPU)
Time for select_range_key2 (5010:42084): 3 wallclock secs ( 2.62 usr 0.24 sys + 0.00 cusr 0.00 csys = 2.86 CPU)
Time for select_key_prefix (200000): 165 wallclock secs (78.24 usr 6.26 sys + 0.00 cusr 0.00 csys = 84.50 CPU)
Time for select_key (200000): 153 wallclock secs (76.70 usr 6.22 sys + 0.00 cusr 0.00 csys = 82.92 CPU)
Time for select_key_return_key (200000): 154 wallclock secs (75.57 usr 5.56 sys + 0.00 cusr 0.00 csys = 81.13 CPU)
Time for select_key2 (200000): 162 wallclock secs (77.81 usr 6.30 sys + 0.00 cusr 0.00 csys = 84.11 CPU)
Time for select_key2_return_key (200000): 156 wallclock secs (74.02 usr 5.30 sys + 0.00 cusr 0.00 csys = 79.32 CPU)
Time for select_key2_return_prim (200000): 158 wallclock secs (75.68 usr 5.72 sys + 0.00 cusr 0.00 csys = 81.40 CPU)

Test of compares with simple ranges
Time for select_range_prefix (20000:43500): 4 wallclock secs ( 3.12 usr 0.26 sys + 0.00 cusr 0.00 csys = 3.38 CPU)
Time for select_range_key2 (20000:43500): 4 wallclock secs ( 3.08 usr 0.26 sys + 0.00 cusr 0.00 csys = 3.34 CPU)
Time for select_group (111): 4 wallclock secs ( 0.05 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.05 CPU)
Time for min_max_on_key (15000): 8 wallclock secs ( 4.59 usr 0.36 sys + 0.00 cusr 0.00 csys = 4.95 CPU)
Time for min_max (60): 8 wallclock secs ( 0.04 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.04 CPU)
Time for count_on_key (100): 32 wallclock secs ( 0.09 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.10 CPU)
Time for count (100): 29 wallclock secs ( 0.08 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.08 CPU)
Time for count_distinct_big (20): 3 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing update of keys with functions
Time for update_of_key (50000): 22 wallclock secs ( 2.34 usr 0.59 sys + 0.00 cusr 0.00 csys = 2.93 CPU)
Time for update_of_key_big (501): 19 wallclock secs ( 0.06 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.07 CPU)

Testing update with key
Time for update_with_key (300000): 62 wallclock secs ( 9.35 usr 3.10 sys + 0.00 cusr 0.00 csys = 12.45 CPU)
Time for update_with_key_prefix (100000): 21 wallclock secs ( 6.65 usr 1.19 sys + 0.00 cusr 0.00 csys = 7.84 CPU)

Testing update of all rows
Time for update_big (10): 23 wallclock secs ( 0.01 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)

Testing left outer join
Time for outer_join_on_key (10:10): 3 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join (10:10): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join_found (10:10): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for outer_join_not_found (500:10): 3 wallclock secs ( 0.01 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.01 CPU)

Testing SELECT ... WHERE id in (10 values)
Time for select_in (500:5000) 0 wallclock secs ( 0.20 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.21 CPU)

Time for select_join_in (500:5000) 0 wallclock secs ( 0.20 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.22 CPU)

Testing SELECT ... WHERE id in (100 values)
Time for select_in (500:50000) 2 wallclock secs ( 0.81 usr 0.08 sys + 0.00 cusr 0.00 csys = 0.89 CPU)

Time for select_join_in (500:50000) 1 wallclock secs ( 0.80 usr 0.08 sys + 0.00 cusr 0.00 csys = 0.88 CPU)

Testing SELECT ... WHERE id in (1000 values)
Time for select_in (500:500000) 8 wallclock secs ( 6.84 usr 0.71 sys + 0.00 cusr 0.00 csys = 7.55 CPU)

Time for select_join_in (500:500000) 8 wallclock secs ( 6.81 usr 0.70 sys + 0.00 cusr 0.00 csys = 7.51 CPU)


Testing INSERT INTO ... SELECT
Time for insert_select_1_key (1): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for insert_select_2_keys (1): 6 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop table(2): 0 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Testing delete
Time for delete_key (10000): 2 wallclock secs ( 0.36 usr 0.11 sys + 0.00 cusr 0.00 csys = 0.47 CPU)
Time for delete_range (12): 9 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Insert into table with 16 keys and with a primary key with 16 parts
Time for insert_key (100000): 630 wallclock secs (16.23 usr 2.29 sys + 0.00 cusr 0.00 csys = 18.52 CPU)

Testing update of keys
Time for update_of_primary_key_many_keys (256): 78 wallclock secs ( 0.04 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.05 CPU)

Deleting rows from the table
Time for delete_big_many_keys (128): 470 wallclock secs ( 0.03 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.04 CPU)

Deleting everything from table
Time for delete_all_many_keys (1): 470 wallclock secs ( 0.03 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.04 CPU)

Inserting 100000 rows with multiple values
Time for multiple_value_insert (100000): 4 wallclock secs ( 0.61 usr 0.03 sys + 0.00 cusr 0.00 csys = 0.64 CPU)

Time for drop table(1): 0 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)

Total time: 3045 wallclock secs (811.47 usr 83.16 sys + 0.00 cusr 0.00 csys = 894.63 CPU)
Now, I expected that libfreevec would be faster than the original but not much, but why? Basically the improvement came strictly from 2 functions, memcpy and memset (mostly memcpy). After some inspection in the MySQL code, I found that it calls memcpy() many times, but using very small sizes, so all the improvement came from the optimization in the scalar code and very little from the Altivec code. Still it's better than nothing.

I intend to do more tests and with more functions vectorized and will post the results as usual.


Top
   
PostPosted: Sat Apr 30, 2005 4:05 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
... Mind you these are just proof-of-concept results. The speed increase is minimal, but it's just one function, and it proves that libfreevec can be used for real work. As more and more functions get in (esp. the str* functions, which are many) the performance gain will go up.

That.


Top
   
PostPosted: Thu May 05, 2005 5:05 am 
Offline

Joined: Tue Dec 14, 2004 3:52 am
Posts: 42
Location: Italy - from Mexico
That's very good news! And you're a hard working coder, no doubt... Keep up!


Top
   
PostPosted: Sun May 08, 2005 11:59 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Ok, I was curious as to how most of my benchmarks of memcpy() (I'm dealing with this as it is the most common routine and most of the rest are just copying the mechanisms there) showed a significant speed increase, but when I linked with MySQL I got very little -if at all- speed gains from that particular function. So I did a custom profile of memcpy() use in MySQL. In every call to memcpy() I wrote the following results to a file:

the result of __builtin_return_address(0) (will explain below)
src word-alignment (src % sizeof(uint32_t))
src quad-word-alignment (src % 16)
dst word-alignment (src % sizeof(uint32_t))
dst quad-word-alignment (src % 16)
length (in bytes)

The actual datafiles can be found here:

http://people.debian.org/~markos/powerpc/mysql-profile/

I ran some simple scripts (munge-stats.pl, provided by a friend of mine who gave me some good advice on optimizing, Pantelis Antoniou) and stats.pl, which is based on the first one but does presents results of a different kind.

I ran the scripts for a sample of 10M (10^6) calls to memcpy. The original set is ~100M, but it's a 2GB text file and I ran out of memory, while processing it :-)
Code:
Same word-alignment stats:
dst4 -> count
0 -> 4180106
1 -> 191
2 -> 50575
3 -> 173

Different word-alignment stats:
dst4 -> count
0 -> 3692531
1 -> 468329
2 -> 979331
3 -> 628764

Same 16b-alignment stats:
dst16 -> count
0 -> 1385930
1 -> 21
2 -> 8136
3 -> 3
4 -> 21
5 -> 9
6 -> 1032
7 -> 3
8 -> 279855
9 -> 17
10 -> 8483
11 -> 31
12 -> 60
13 -> 0
14 -> 165
15 -> 3

Different 16b-alignment stats:
dst16 -> count
0 -> 3458045
1 -> 2697
2 -> 220950
3 -> 1259
4 -> 301838
5 -> 85562
6 -> 441238
7 -> 20405
8 -> 2443586
9 -> 1595
10 -> 258891

memcpy len stats:
len -> count
0 -> 13490
1 -> 230922
2 -> 377697
3 -> 759875
4 -> 359972
5 -> 1264863
6 -> 1461654
7 -> 377148
8 -> 930741
9 -> 167
10 -> 600106
11 -> 305529
12 -> 1018399
13 -> 57
14 -> 90
15 -> 9315
16 -> 1457
17 -> 123
18 -> 107
19 -> 300061
20 -> 69
21 -> 48
22 -> 55
23 -> 29
24 -> 933
25 -> 66
26 -> 42
27 -> 63
28 -> 44
29 -> 20
30 -> 59
31 -> 27
32 -> 1251
33 -> 145
34 -> 8
35 -> 598
36 -> 2023
37 -> 5903
38 -> 39
39 -> 58316
40 -> 754005
41 -> 70912
42 -> 5
43 -> 394837
44 -> 8
45 -> 10
46 -> 10
47 -> 12
48 -> 540
49 -> 3
50 -> 9
51 -> 6
52 -> 4
53 -> 1
55 -> 4
56 -> 151
57 -> 5
58 -> 15
59 -> 1
60 -> 336
61 -> 12
62 -> 535
63 -> 13
64 -> 5512
65 -> 5
66 -> 52965
67 -> 4
68 -> 17968
69 -> 182
70 -> 12
72 -> 3746
73 -> 1
74 -> 4
75 -> 18004
76 -> 4
78 -> 180068
80 -> 1599
81 -> 400000
84 -> 194
88 -> 72
91 -> 2
93 -> 5
94 -> 2
96 -> 271
102 -> 2
104 -> 82
108 -> 180
110 -> 1
112 -> 86
114 -> 1
120 -> 261
123 -> 3
124 -> 2
127 -> 2
128 -> 221
131 -> 1
132 -> 189
135 -> 2
136 -> 95
137 -> 4
138 -> 3
140 -> 12
144 -> 308
146 -> 1
152 -> 145
153 -> 49
156 -> 1868
157 -> 22
160 -> 1162
168 -> 1701
172 -> 2
174 -> 2
179 -> 11
180 -> 4
189 -> 1
191 -> 1
195 -> 4
196 -> 2
197 -> 4
228 -> 1
240 -> 11
243 -> 1
245 -> 1
256 -> 1
263 -> 4
274 -> 4
289 -> 4
318 -> 48
322 -> 27
324 -> 1464
328 -> 959
333 -> 42
336 -> 2405
413 -> 2
480 -> 2
492 -> 1
495 -> 1
512 -> 34
581 -> 2
602 -> 2
663 -> 117
672 -> 4828
680 -> 29
685 -> 4
16384 -> 7
And the results from munge-stats.pl. The fields represent:
return address
number of calls for this address
std deviation, if 0, the number in parenthesis represents the number of calls for this size.
Code:
key nr std-dev
0x101b3e18 1 0 (31) *
0x101b3f1c 1 0 (50) *
0x101ba0cc 1 0 (50) *
0x101e4d0c 1 0 (256) *
0x102151a4 1 0 (36) *
0x10215448 1 0 (4) *
0x102cc250 1 0 (512) *
0x102e3d2c 1 0 (16384) *
0x102f5b04 1 0 (512) *
0x1032e560 1 0 (46) *
0x1032f88c 1 0 (32) *
0x1032fba0 1 0 (32) *
0x103dc158 1 0 (16) *
0x1040a608 1 0 (512) *
0x1040ed14 1 0 (20) *
0x10149890 2 0 (1) *
0x10254298 2 3.60555127546399
0x102ee778 2 0 (512) *
0x1030f1fc 2 2.82842712474619
0x1033e1a0 2 3.60555127546399
0x1014bfd8 3 0 (12) *
0x1030e838 3 2.23606797749979
0x103e03b4 3 0 (3) *
0x103e053c 3 2.23606797749979
0xfea2b64 3 0 (128) *
0x102587d0 4 0 (6) *
0x10411328 4 0 (7) *
0x1041ca54 4 0 (80) *
0x1041cbf4 4 14
0xf883460 4 0 (36) *
0xfeef598 4 0 (15) *
0xff47028 4 0 (7) *
0xff5c7ac 4 0 (180) *
0x1025a2a0 5 3.16227766016838
0x10358358 6 16921.3224365
0x103dcd1c 6 0 (16) *
0x101525a0 8 0 (128) *
0x102576cc 8 4
0xff610f8 10 1.73205080756888
0x1013f3bc 11 0 (240) *
0x10402a88 11 9.4339811320566
0x104302d4 12 67.3424086293325
0xfc0896c 12 0 (140) *
0xfee3af8 12 321.158839205774
0xfc06264 15 0 (128) *
0x103dc0c8 17 4.69041575982343
0x10414ae8 18 2
0xfed775c 19 21.3072757526625
0xfea2ac4 24 0 (128) *
0x10189474 29 105.047608254543
0x10189498 29 0 (0) *
0x1018bbdc 29 4.35889894354067
0x101e1f28 29 5.3851648071345
0x103cc330 29 10.0498756211209
0x103cc4f4 29 0 (512) *
0x103ccbd0 29 0 (680) *
0x103ccbe8 29 5.3851648071345
0x103ccbfc 29 3.87298334620742
0x103ccc14 29 2
0xfeb4594 30 8.94427190999916
0x10431bc8 34 1
0x1043206c 34 4.58257569495584
0x101cef8c 36 4.24264068711928
0x1018a2bc 57 155.692003648229
0x10431bfc 83 3.87298334620742
0x1034f7c8 90 2.82842712474619
0xfc08780 95 0 (128) *
0x103dc238 132 0 (1) *
0x1034ea30 152 0 (7) *
0xff183b0 155 7.48331477354788
0x104111f8 177 12.2474487139159
0x10419794 180 14.142135623731
0x10411640 249 6.40312423743285
0x10443068 311 1
0x10442a38 418 2
0x10416d80 438 0 (8) *
0xfed7a20 557 23.3452350598575
0x10148554 945 8.24621125123532
0x103e06c4 1964 1.73205080756888
0x103d49cc 1967 1.73205080756888
0x103e02d0 1967 18.9208879284245
0x103e0b10 4945 646.069655687372
0x103e0b24 4945 13.3790881602597
0x103e0b58 4945 330.607017469382
0x103e0b68 4945 13.3790881602597
0x103e0b7c 4945 13.3790881602597
0x103e0b90 4945 13.3790881602597
0x103e0a10 11098 10.3923048454133
0x103e0a20 11098 10.3923048454133
0x103e0a30 11098 85.9825563704639
0x103e08e8 13262 10.0498756211209
0x103e0908 13262 87.4356906531881
0x103e091c 13262 10.0498756211209
0x10411f24 77062 39.2683078321437
0x1014d74c 153712 5.09901951359278
0x10432d54 153728 5.09901951359278
0x101344d0 376855 8.88819441731559
0x101a5810 376855 43
0x103d69b0 376855 39.9874980462644
0x100bdc5c 376856 8.88819441731559
0x103d68f8 376882 40
0x10167914 376885 73.5934779718964
0x1041201c 376904 7
0x103e90bc 606919 10.0995049383621
0x103d4ac0 608879 10.1980390271856
0x103d4828 678822 10.4403065089106
0x10147760 754516 13.8924439894498
0x1040ff78 1200011 41.5090351610345
0x10411fa0 3014882 4.69041575982343
So, we see from these results why the altivec version of memcpy() does not really benefit MySQL: the sizes used are too small. Still, I see some other information that might be of benefit and still allow me to get some optimizations:

* Most calls were with aligned addresses. Specifically, the majority of calls had the destination pointer (dst) aligned to either a longword boundary (4b) or quad-word (16b), used by Altivec.
* Also, lots of calls were made with very small sizes, with very few calls reaching up to 16384 (the max size used by memcpy). 99% of the calls were below 100 bytes, and this tells me that I could use that to my benefit, writing optimized code that offers max speed for these cases.
* For the munge-stats.pl results, I will need to explain sth first. As Pantelis told me, when __builtin_return_address is different, this means a new memory allocation. The last column represents the standard deviation for the size used in the call to memcpy(). If std-dev is 0, it means this call was always made with a constant size. For constant allocations, these cases could be handled by gcc itself, inlining faster code that handles copying of preset sizes (eg. small sizes and/or sizes % 16 == 0). This was an important observation and one that might be used to as an useful generic optimization to gcc. This will be something that I will try to look in the future, though, unless someone beats me to it :-)

So my plans is to rewrite the code to take advantage of these results. Namely, I consider doing the following:
* Assume word-alignment and do copy as usual. Use Altivec if size permits.
* If alignment was not on a 4b, then copy the initial remaining bytes, otherwise return.

Now this might or it might not offer any improvement, it remains to be seen. Of course these conclusions apply to most other glibc functions I'm working on (mem*, etc).

Last, I tried to do some profiling with gprof on MySQL to actually measure the time that memcpy takes overall, but in all gprof calls I tried afterwards, I could not get any glibc function to show in the summary. Any gprof-guru around?

So, conclusion? MySQL won't benefit much from vectorization/optimization of the glibc functions. It will benefit, but not much. As people told me, MySQL is mostly I/O bound. Given these results I don't expect more than 10-20% max. It will probably benefit more from aforementioned gcc optimizations, or vectorization of the hashing techniques used in MySQL itself. But this is another topic, which I will discuss in another post.

What does this mean? I need to find some other memory-bound application that will show that the Altivec optimizations in glibc are worth the effort. Any ideas?

Konstantinos


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 4 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group