**Quote:**

numbers in seconds, smaller is better :-D

(I'm not happy with the Determinant function, I must have done sth terribly wrong with it to be slower than the original function).

I look at your code and notice a few things which could be improved which might be true and might just be paranoia on my part, but here goes;

Most of the matrix functions (like VectorMultiplyOf etc.) operate doing several vec_madd() operations on the back of each other. However each one is dependant on the result of the last.

**Code:**

void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)

{

/*

Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is

merely added.

*/

// Load SIMDx86 matrix and vector

vector float vin1 = vec_ld(0, &pMat->m[0]);

vector float vin2 = vec_ld(16, &pMat->m[0]);

vector float vin3 = vec_ld(32, &pMat->m[0]);

vector float vin4 = vec_ld(48, &pMat->m[0]);

vector float vvec = vec_ld(0, (float *)pIn);

vector float vvec1 = vec_splat(vvec, 0);

vector float vvec2 = vec_splat(vvec, 1);

vector float vvec3 = vec_splat(vvec, 2);

vector float v0 = (vector float) vec_splat_u32(0);

vector float vres1, vres2, vres3, vres;

// Do the vector x matrix multiplication

vres1 = vec_madd(vin1, vvec1, v0);

vres2 = vec_madd(vin2, vvec2, vres1);

vres3 = vec_madd(vin3, vvec3, vres2);

vres = vec_add(vres3, vin4);

// Store back the result

vec_st(vres, 0, (float *)pOut);

}

vec_madd executes in 4 or 5 cycles, and the next cannot be executed until it's finished (vres2 needs vres1 to calculate). This is a long time to wait.

I notice, though, that some of data is NOT dependant - vin1, vin2, vin3, vvec1, vvec2, vvec3 - they are generated by vec_splat in the function "prologue" or loaded from memory.

You couldn't really "gain" performance by moving the vin* variables around, since a memory load will stall the entire function for perhaps hundreds of cycles, but a vec_splat could be moved. In effect you would have;

**Code:**

void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)

{

/*

Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is

merely added.

*/

vector float vvec2, vvec3;

vector float vres1, vres2, vres3, vres;

// Load SIMDx86 matrix and vector

vector float vvec = vec_ld(0, (float *)pIn);

vector float vin1 = vec_ld(0, &pMat->m[0]);

vector float vin2 = vec_ld(16, &pMat->m[0]);

vector float vin3 = vec_ld(32, &pMat->m[0]);

vector float vin4 = vec_ld(48, &pMat->m[0]);

vector float vvec1 = vec_splat(vvec, 0);

vector float v0 = (vector float) vec_splat_u32(0);

// Do the vector x matrix multiplication

vres1 = vec_madd(vin1, vvec1, v0);

vvec2 = vec_splat(vvec, 1);

vres2 = vec_madd(vin2, vvec2, vres1);

vvec3 = vec_splat(vvec, 2);

vres3 = vec_madd(vin3, vvec3, vres2);

vres = vec_add(vres3, vin4);

// Store back the result

vec_st(vres, 0, (float *)pOut);

}

I realise this is verging on the realms of a micro-optimisation, and I have not run this through SimG4, nor can I actually test the performance here (I only have an Efika and a Pentium M to spare :) but the way I see it, you get a vec_splat for free between each vec_madd, reducing the time taken to go through the function prologue. Given the high latency of the vec_madd and the data dependency, it's impossible that the next vec_madd would be ready before vec_splat is finished, plus they operate in completely different units (vector simple and vector complex). I gain 2 cycles in my head :)

Does that sound sane or insane? :D

I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.

Anyway. There is probably a lot you can do with a little fancy reordering..