Power Developer • SIMDx86 AltiVec optimized!

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

SIMDx86 AltiVec optimized!

Post new topic Reply to topic

Page 1 of 1

[ 6 posts ]

Print view

Previous topic | Next topic

Author

Message

markos

Post subject: SIMDx86 AltiVec optimized!

PostPosted: Sat Feb 09, 2008 5:53 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Hi, those of you that don't know about it, SIMDx86 is a library containing matrix/vector/image/plane/quaternion/etc functions in C, SSE, 3dNow!, and from now on AltiVec optimized routines, offering a nice common API for all these routines and high performance in all those platforms that support it. I've sent the patch to version 0.4 to the upstream developer, and I'll keep sending them as I finish the optimizations (some routines haven't been implemented yet).

Why SIMDx86? Well, it was easy to do, nice proof of concept, it includes pretty much the same routines as Mesa, Ogre3D, ODE, Blender, etc. and other 3D software, which means my next step is in optimizing most or all of the aforementioned software (Mesa and Blender are top priority).

Some benchmarks (Google Charts graphs will follow shortly) from matrix/vector operations:

Code:

vector:

altivec:        time Cross(): 0.260000 sec

scalar :        time Cross(): 0.380000 sec

altivec:        time CrossOf(): 0.210000 sec

scalar :        time CrossOf(): 0.240000 sec

altivec:        time Diff(): 0.160000 sec

scalar :        time Diff(): 0.330000 sec

altivec:        time DiffOf(): 0.120000 sec

scalar :        time DiffOf(): 0.200000 sec

scalar :        time Dot(): 0.160000 sec

altivec:        time Dot(): 0.270000 sec

scalar :        time Dot4(): 0.200000 sec

altivec:        time Dot4(): 0.280000 sec

altivec:        time Length(): 0.670000 sec

scalar :        time Length(): 2.230000 sec

scalar :        time LengthSq(): 0.160000 sec

altivec:        time LengthSq(): 0.280000 sec

altivec:        time Normalize(): 0.470000 sec

scalar :        time Normalize(): 2.680000 sec

altivec:        time NormalizeOf(): 0.410000 sec

scalar :        time NormalizeOf(): 2.780000 sec

scalar :        time Scale(): 0.270000 sec

altivec:        time Scale(): 0.490000 sec

scalar :        time ScaleOf(): 0.160000 sec

altivec:        time ScaleOf(): 0.440000 sec

altivec:        time Sum(): 0.210000 sec

scalar :        time Sum(): 0.410000 sec

altivec:        time SumOf(): 0.140000 sec

scalar :        time SumOf(): 0.220000 sec

matrix:

scalar :        time Determinant(): 0.520000 sec

altivec:        time Determinant(): 0.610000 sec

altivec:        time Diff(): 0.310000 sec

scalar :        time Diff(): 0.870000 sec

altivec:        time DiffOf(): 0.220000 sec

scalar :        time DiffOf(): 0.830000 sec

altivec:        time Multiply(): 0.400000 sec

scalar :        time Multiply(): 4.140000 sec

altivec:        time MultiplyOf(): 0.370000 sec

scalar :        time MultiplyOf(): 2.620000 sec

altivec:        time Scale(): 0.490000 sec

scalar :        time Scale(): 0.860000 sec

altivec:        time ScaleOf(): 0.520000 sec

scalar :        time ScaleOf(): 0.740000 sec

altivec:        time Sum(): 0.360000 sec

scalar :        time Sum(): 1.120000 sec

altivec:        time SumOf(): 0.210000 sec

scalar :        time SumOf(): 0.820000 sec

altivec:        time ToIdentity(): 0.130000 sec

scalar :        time ToIdentity(): 0.700000 sec

altivec:        time ToTranslate(): 0.280000 sec

scalar :        time ToTranslate(): 0.900000 sec

altivec:        time ToTranslateOf(): 0.120000 sec

scalar :        time ToTranslateOf(): 0.770000 sec

altivec:        time Transpose(): 0.180000 sec

scalar :        time Transpose(): 1.840000 sec

altivec:        time TransposeOf(): 0.200000 sec

scalar :        time TransposeOf(): 0.550000 sec

altivec:        time Vector4Multiply(): 0.240000 sec

scalar :        time Vector4Multiply(): 0.660000 sec

altivec:        time Vector4MultiplyOf(): 0.210000 sec

scalar :        time Vector4MultiplyOf(): 0.670000 sec

altivec:        time VectorMultiply(): 0.240000 sec

scalar :        time VectorMultiply(): 0.570000 sec

altivec:        time VectorMultiplyOf(): 0.260000 sec

scalar :        time VectorMultiplyOf(): 0.510000 sec

numbers in seconds, smaller is better :-D

(I'm not happy with the Determinant function, I must have done sth terribly wrong with it to be slower than the original function).

For those of you that would like to play with it, the updated code/patch is at:

http://www.freevec.org/old/SIMDx86.tgz
http://www.freevec.org/old/SIMDx86-altivec+.patch.gz

Enjoy

Konstantinos

Last edited by markos on Thu Mar 20, 2008 9:10 am, edited 1 time in total.

Top

Profile

Reply with quote

bbrv

Post subject:

PostPosted: Sat Feb 09, 2008 6:01 am

Offline

Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422

That is great work Konstantinos. You are helping make the likelihood of the EFIKA8610 even more of a possibility. Super!

R&B :)

_________________
http://bbrv.blogspot.com

Top

Profile

Reply with quote

Neko

Post subject: Re: SIMDx86 AltiVec optimized!

PostPosted: Tue Feb 12, 2008 11:06 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX

Quote:

numbers in seconds, smaller is better :-D

(I'm not happy with the Determinant function, I must have done sth terribly wrong with it to be slower than the original function).

I look at your code and notice a few things which could be improved which might be true and might just be paranoia on my part, but here goes;

Most of the matrix functions (like VectorMultiplyOf etc.) operate doing several vec_madd() operations on the back of each other. However each one is dependant on the result of the last.

Code:

void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)

{

	/*

		Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is

		merely added.

	*/

	// Load SIMDx86 matrix and vector

	vector float vin1 = vec_ld(0, &pMat->m[0]);

	vector float vin2 = vec_ld(16, &pMat->m[0]);

	vector float vin3 = vec_ld(32, &pMat->m[0]);

	vector float vin4 = vec_ld(48, &pMat->m[0]);

	vector float vvec = vec_ld(0, (float *)pIn);

	vector float vvec1 = vec_splat(vvec, 0);

	vector float vvec2 = vec_splat(vvec, 1);

	vector float vvec3 = vec_splat(vvec, 2);

	vector float v0 = (vector float) vec_splat_u32(0);

	vector float vres1, vres2, vres3, vres;

	// Do the vector x matrix multiplication

	vres1 = vec_madd(vin1, vvec1, v0);

	vres2 = vec_madd(vin2, vvec2, vres1);

	vres3 = vec_madd(vin3, vvec3, vres2);

	vres  = vec_add(vres3, vin4);

	// Store back the result

	vec_st(vres, 0, (float *)pOut);

}

vec_madd executes in 4 or 5 cycles, and the next cannot be executed until it's finished (vres2 needs vres1 to calculate). This is a long time to wait.

I notice, though, that some of data is NOT dependant - vin1, vin2, vin3, vvec1, vvec2, vvec3 - they are generated by vec_splat in the function "prologue" or loaded from memory.

You couldn't really "gain" performance by moving the vin* variables around, since a memory load will stall the entire function for perhaps hundreds of cycles, but a vec_splat could be moved. In effect you would have;

Code:

void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)

{

	/*

		Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is

		merely added.

	*/

	vector float vvec2, vvec3;

	vector float vres1, vres2, vres3, vres;

	// Load SIMDx86 matrix and vector

	vector float vvec = vec_ld(0, (float *)pIn);

	vector float vin1 = vec_ld(0, &pMat->m[0]);

	vector float vin2 = vec_ld(16, &pMat->m[0]);

	vector float vin3 = vec_ld(32, &pMat->m[0]);

	vector float vin4 = vec_ld(48, &pMat->m[0]);

	vector float vvec1 = vec_splat(vvec, 0);	

	vector float v0 = (vector float) vec_splat_u32(0);

	// Do the vector x matrix multiplication

	vres1 = vec_madd(vin1, vvec1, v0);

	vvec2 = vec_splat(vvec, 1);

	vres2 = vec_madd(vin2, vvec2, vres1);

	vvec3 = vec_splat(vvec, 2);

	vres3 = vec_madd(vin3, vvec3, vres2);

	vres  = vec_add(vres3, vin4);

	// Store back the result

	vec_st(vres, 0, (float *)pOut);

}

I realise this is verging on the realms of a micro-optimisation, and I have not run this through SimG4, nor can I actually test the performance here (I only have an Efika and a Pentium M to spare :) but the way I see it, you get a vec_splat for free between each vec_madd, reducing the time taken to go through the function prologue. Given the high latency of the vec_madd and the data dependency, it's impossible that the next vec_madd would be ready before vec_splat is finished, plus they operate in completely different units (vector simple and vector complex). I gain 2 cycles in my head :)

Does that sound sane or insane? :D

I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.

Anyway. There is probably a lot you can do with a little fancy reordering..

_________________
Matt Sealey

Top

Profile

Reply with quote

markos

Post subject: Re: SIMDx86 AltiVec optimized!

PostPosted: Tue Feb 12, 2008 11:45 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

Does that sound sane or insane? :D

I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.

Anyway. There is probably a lot you can do with a little fancy reordering..

Actually, the compiler does most of the job for me (output of objdump -S):

Code:

000008c0 <SIMDx86Matrix_VectorMultiplyOf>:

     8c0:       94 21 ff f0     stwu    r1,-16(r1)

     8c4:       10 00 03 8c     vspltisw v0,0

     8c8:       39 25 00 20     addi    r9,r5,32

     8cc:       7c 20 20 ce     lvx     v1,r0,r4

     8d0:       38 85 00 10     addi    r4,r5,16

     8d4:       38 21 00 10     addi    r1,r1,16

     8d8:       7d a0 28 ce     lvx     v13,r0,r5

     8dc:       38 a5 00 30     addi    r5,r5,48

     8e0:       7d 80 20 ce     lvx     v12,r0,r4

     8e4:       11 60 0a 8c     vspltw  v11,v1,0

     8e8:       11 41 0a 8c     vspltw  v10,v1,1

     8ec:       11 ad 02 ee     vmaddfp v13,v13,v11,v0

     8f0:       7c 00 48 ce     lvx     v0,r0,r9

     8f4:       10 22 0a 8c     vspltw  v1,v1,2

     8f8:       7d 60 28 ce     lvx     v11,r0,r5

     8fc:       11 8c 6a ae     vmaddfp v12,v12,v10,v13

     900:       11 40 60 6e     vmaddfp v10,v0,v1,v12

     904:       10 0a 58 0a     vaddfp  v0,v10,v11

     908:       7c 00 19 ce     stvx    v0,r0,r3

     90c:       4e 80 00 20     blr

It reorganizes the instructions in a much more efficient way. I'd go as far as to say that gcc has improved a lot in instruction scheduling for powerpc, compared with previous versions at least, can't say in comparison with x86. :)

Konstantinos

Top

Profile

Reply with quote

Neko

Post subject: Re: SIMDx86 AltiVec optimized!

PostPosted: Tue Feb 12, 2008 12:34 pm

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX

Quote:

Actually, the compiler does most of the job for me (output of objdump -S):

Code:

     8ec:       11 ad 02 ee     vmaddfp v13,v13,v11,v0

     8f0:       7c 00 48 ce     lvx     v0,r0,r9

     8f4:       10 22 0a 8c     vspltw  v1,v1,2

     8f8:       7d 60 28 ce     lvx     v11,r0,r5

     8fc:       11 8c 6a ae     vmaddfp v12,v12,v10,v13

     900:       11 40 60 6e     vmaddfp v10,v0,v1,v12

     904:       10 0a 58 0a     vaddfp  v0,v10,v11

     908:       7c 00 19 ce     stvx    v0,r0,r3

     90c:       4e 80 00 20     blr

It reorganizes the instructions in a much more efficient way. I'd go as far as to say that gcc has improved a lot in instruction scheduling for powerpc

It does look very efficient (keeping the processor busy, at least), but I don't like that idea of putting two loads between the first and second vec_madd. It seems like putting too much trust in the speed and efficiency of the L2 cache.. that you could perform those two loads in time for the vec_add, using the vec_madd as a kind of "buffer" is really unintuitive..

_________________
Matt Sealey

Top

Profile

Reply with quote

bbrv

Post subject:

PostPosted: Wed Feb 13, 2008 2:34 pm

Offline

Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422

Posting link again on this thread...

http://jp.youtube.com/watch?v=SZDusxG13QQ

Don't miss the 8610 in action. It could make a perfect host for this work.

R&B :)

_________________
http://bbrv.blogspot.com

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 1 of 1

[ 6 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 1 guest

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum