All times are UTC-06:00




Post new topic  Reply to topic  [ 6 posts ] 
Author Message
PostPosted: Sat Feb 09, 2008 5:53 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Image

Hi, those of you that don't know about it, SIMDx86 is a library containing matrix/vector/image/plane/quaternion/etc functions in C, SSE, 3dNow!, and from now on AltiVec optimized routines, offering a nice common API for all these routines and high performance in all those platforms that support it. I've sent the patch to version 0.4 to the upstream developer, and I'll keep sending them as I finish the optimizations (some routines haven't been implemented yet).

Why SIMDx86? Well, it was easy to do, nice proof of concept, it includes pretty much the same routines as Mesa, Ogre3D, ODE, Blender, etc. and other 3D software, which means my next step is in optimizing most or all of the aforementioned software (Mesa and Blender are top priority).

Some benchmarks (Google Charts graphs will follow shortly) from matrix/vector operations:
Code:
vector:
altivec: time Cross(): 0.260000 sec
scalar : time Cross(): 0.380000 sec
altivec: time CrossOf(): 0.210000 sec
scalar : time CrossOf(): 0.240000 sec
altivec: time Diff(): 0.160000 sec
scalar : time Diff(): 0.330000 sec
altivec: time DiffOf(): 0.120000 sec
scalar : time DiffOf(): 0.200000 sec
scalar : time Dot(): 0.160000 sec
altivec: time Dot(): 0.270000 sec
scalar : time Dot4(): 0.200000 sec
altivec: time Dot4(): 0.280000 sec
altivec: time Length(): 0.670000 sec
scalar : time Length(): 2.230000 sec
scalar : time LengthSq(): 0.160000 sec
altivec: time LengthSq(): 0.280000 sec
altivec: time Normalize(): 0.470000 sec
scalar : time Normalize(): 2.680000 sec
altivec: time NormalizeOf(): 0.410000 sec
scalar : time NormalizeOf(): 2.780000 sec
scalar : time Scale(): 0.270000 sec
altivec: time Scale(): 0.490000 sec
scalar : time ScaleOf(): 0.160000 sec
altivec: time ScaleOf(): 0.440000 sec
altivec: time Sum(): 0.210000 sec
scalar : time Sum(): 0.410000 sec
altivec: time SumOf(): 0.140000 sec
scalar : time SumOf(): 0.220000 sec

matrix:
scalar : time Determinant(): 0.520000 sec
altivec: time Determinant(): 0.610000 sec
altivec: time Diff(): 0.310000 sec
scalar : time Diff(): 0.870000 sec
altivec: time DiffOf(): 0.220000 sec
scalar : time DiffOf(): 0.830000 sec
altivec: time Multiply(): 0.400000 sec
scalar : time Multiply(): 4.140000 sec
altivec: time MultiplyOf(): 0.370000 sec
scalar : time MultiplyOf(): 2.620000 sec
altivec: time Scale(): 0.490000 sec
scalar : time Scale(): 0.860000 sec
altivec: time ScaleOf(): 0.520000 sec
scalar : time ScaleOf(): 0.740000 sec
altivec: time Sum(): 0.360000 sec
scalar : time Sum(): 1.120000 sec
altivec: time SumOf(): 0.210000 sec
scalar : time SumOf(): 0.820000 sec
altivec: time ToIdentity(): 0.130000 sec
scalar : time ToIdentity(): 0.700000 sec
altivec: time ToTranslate(): 0.280000 sec
scalar : time ToTranslate(): 0.900000 sec
altivec: time ToTranslateOf(): 0.120000 sec
scalar : time ToTranslateOf(): 0.770000 sec
altivec: time Transpose(): 0.180000 sec
scalar : time Transpose(): 1.840000 sec
altivec: time TransposeOf(): 0.200000 sec
scalar : time TransposeOf(): 0.550000 sec
altivec: time Vector4Multiply(): 0.240000 sec
scalar : time Vector4Multiply(): 0.660000 sec
altivec: time Vector4MultiplyOf(): 0.210000 sec
scalar : time Vector4MultiplyOf(): 0.670000 sec
altivec: time VectorMultiply(): 0.240000 sec
scalar : time VectorMultiply(): 0.570000 sec
altivec: time VectorMultiplyOf(): 0.260000 sec
scalar : time VectorMultiplyOf(): 0.510000 sec
numbers in seconds, smaller is better :-D

(I'm not happy with the Determinant function, I must have done sth terribly wrong with it to be slower than the original function).

For those of you that would like to play with it, the updated code/patch is at:

http://www.freevec.org/old/SIMDx86.tgz
http://www.freevec.org/old/SIMDx86-altivec+.patch.gz

Enjoy

Konstantinos


Last edited by markos on Thu Mar 20, 2008 9:10 am, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Sat Feb 09, 2008 6:01 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
That is great work Konstantinos. You are helping make the likelihood of the EFIKA8610 even more of a possibility. Super!

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
PostPosted: Tue Feb 12, 2008 11:06 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
numbers in seconds, smaller is better :-D

(I'm not happy with the Determinant function, I must have done sth terribly wrong with it to be slower than the original function).
I look at your code and notice a few things which could be improved which might be true and might just be paranoia on my part, but here goes;

Most of the matrix functions (like VectorMultiplyOf etc.) operate doing several vec_madd() operations on the back of each other. However each one is dependant on the result of the last.
Code:
void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)
{
/*
Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is
merely added.
*/

// Load SIMDx86 matrix and vector
vector float vin1 = vec_ld(0, &pMat->m[0]);
vector float vin2 = vec_ld(16, &pMat->m[0]);
vector float vin3 = vec_ld(32, &pMat->m[0]);
vector float vin4 = vec_ld(48, &pMat->m[0]);
vector float vvec = vec_ld(0, (float *)pIn);
vector float vvec1 = vec_splat(vvec, 0);
vector float vvec2 = vec_splat(vvec, 1);
vector float vvec3 = vec_splat(vvec, 2);
vector float v0 = (vector float) vec_splat_u32(0);
vector float vres1, vres2, vres3, vres;

// Do the vector x matrix multiplication
vres1 = vec_madd(vin1, vvec1, v0);
vres2 = vec_madd(vin2, vvec2, vres1);
vres3 = vec_madd(vin3, vvec3, vres2);
vres = vec_add(vres3, vin4);

// Store back the result
vec_st(vres, 0, (float *)pOut);
}
vec_madd executes in 4 or 5 cycles, and the next cannot be executed until it's finished (vres2 needs vres1 to calculate). This is a long time to wait.

I notice, though, that some of data is NOT dependant - vin1, vin2, vin3, vvec1, vvec2, vvec3 - they are generated by vec_splat in the function "prologue" or loaded from memory.

You couldn't really "gain" performance by moving the vin* variables around, since a memory load will stall the entire function for perhaps hundreds of cycles, but a vec_splat could be moved. In effect you would have;
Code:
void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)
{
/*
Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is
merely added.
*/

vector float vvec2, vvec3;
vector float vres1, vres2, vres3, vres;

// Load SIMDx86 matrix and vector
vector float vvec = vec_ld(0, (float *)pIn);
vector float vin1 = vec_ld(0, &pMat->m[0]);
vector float vin2 = vec_ld(16, &pMat->m[0]);
vector float vin3 = vec_ld(32, &pMat->m[0]);
vector float vin4 = vec_ld(48, &pMat->m[0]);
vector float vvec1 = vec_splat(vvec, 0);
vector float v0 = (vector float) vec_splat_u32(0);

// Do the vector x matrix multiplication
vres1 = vec_madd(vin1, vvec1, v0);
vvec2 = vec_splat(vvec, 1);
vres2 = vec_madd(vin2, vvec2, vres1);
vvec3 = vec_splat(vvec, 2);
vres3 = vec_madd(vin3, vvec3, vres2);
vres = vec_add(vres3, vin4);

// Store back the result
vec_st(vres, 0, (float *)pOut);
}
I realise this is verging on the realms of a micro-optimisation, and I have not run this through SimG4, nor can I actually test the performance here (I only have an Efika and a Pentium M to spare :) but the way I see it, you get a vec_splat for free between each vec_madd, reducing the time taken to go through the function prologue. Given the high latency of the vec_madd and the data dependency, it's impossible that the next vec_madd would be ready before vec_splat is finished, plus they operate in completely different units (vector simple and vector complex). I gain 2 cycles in my head :)

Does that sound sane or insane? :D

I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.

Anyway. There is probably a lot you can do with a little fancy reordering..

_________________
Matt Sealey


Top
   
PostPosted: Tue Feb 12, 2008 11:45 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Does that sound sane or insane? :D

I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.

Anyway. There is probably a lot you can do with a little fancy reordering..
Actually, the compiler does most of the job for me (output of objdump -S):
Code:
000008c0 <SIMDx86Matrix_VectorMultiplyOf>:
8c0: 94 21 ff f0 stwu r1,-16(r1)
8c4: 10 00 03 8c vspltisw v0,0
8c8: 39 25 00 20 addi r9,r5,32
8cc: 7c 20 20 ce lvx v1,r0,r4
8d0: 38 85 00 10 addi r4,r5,16
8d4: 38 21 00 10 addi r1,r1,16
8d8: 7d a0 28 ce lvx v13,r0,r5
8dc: 38 a5 00 30 addi r5,r5,48
8e0: 7d 80 20 ce lvx v12,r0,r4
8e4: 11 60 0a 8c vspltw v11,v1,0
8e8: 11 41 0a 8c vspltw v10,v1,1
8ec: 11 ad 02 ee vmaddfp v13,v13,v11,v0
8f0: 7c 00 48 ce lvx v0,r0,r9
8f4: 10 22 0a 8c vspltw v1,v1,2
8f8: 7d 60 28 ce lvx v11,r0,r5
8fc: 11 8c 6a ae vmaddfp v12,v12,v10,v13
900: 11 40 60 6e vmaddfp v10,v0,v1,v12
904: 10 0a 58 0a vaddfp v0,v10,v11
908: 7c 00 19 ce stvx v0,r0,r3
90c: 4e 80 00 20 blr
It reorganizes the instructions in a much more efficient way. I'd go as far as to say that gcc has improved a lot in instruction scheduling for powerpc, compared with previous versions at least, can't say in comparison with x86. :)

Konstantinos


Top
   
PostPosted: Tue Feb 12, 2008 12:34 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Quote:
Does that sound sane or insane? :D

I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.

Anyway. There is probably a lot you can do with a little fancy reordering..
Actually, the compiler does most of the job for me (output of objdump -S):
Code:
8ec: 11 ad 02 ee vmaddfp v13,v13,v11,v0
8f0: 7c 00 48 ce lvx v0,r0,r9
8f4: 10 22 0a 8c vspltw v1,v1,2
8f8: 7d 60 28 ce lvx v11,r0,r5
8fc: 11 8c 6a ae vmaddfp v12,v12,v10,v13
900: 11 40 60 6e vmaddfp v10,v0,v1,v12
904: 10 0a 58 0a vaddfp v0,v10,v11
908: 7c 00 19 ce stvx v0,r0,r3
90c: 4e 80 00 20 blr
It reorganizes the instructions in a much more efficient way. I'd go as far as to say that gcc has improved a lot in instruction scheduling for powerpc
It does look very efficient (keeping the processor busy, at least), but I don't like that idea of putting two loads between the first and second vec_madd. It seems like putting too much trust in the speed and efficiency of the L2 cache.. that you could perform those two loads in time for the vec_add, using the vec_madd as a kind of "buffer" is really unintuitive..

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Wed Feb 13, 2008 2:34 pm 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Posting link again on this thread...

http://jp.youtube.com/watch?v=SZDusxG13QQ

Don't miss the 8610 in action. It could make a perfect host for this work.

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 6 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group