Quote:
numbers in seconds, smaller is better :-D
(I'm not happy with the Determinant function, I must have done sth terribly wrong with it to be slower than the original function).
I look at your code and notice a few things which could be improved which might be true and might just be paranoia on my part, but here goes;
Most of the matrix functions (like VectorMultiplyOf etc.) operate doing several vec_madd() operations on the back of each other. However each one is dependant on the result of the last.
Code:
void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)
{
/*
Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is
merely added.
*/
// Load SIMDx86 matrix and vector
vector float vin1 = vec_ld(0, &pMat->m[0]);
vector float vin2 = vec_ld(16, &pMat->m[0]);
vector float vin3 = vec_ld(32, &pMat->m[0]);
vector float vin4 = vec_ld(48, &pMat->m[0]);
vector float vvec = vec_ld(0, (float *)pIn);
vector float vvec1 = vec_splat(vvec, 0);
vector float vvec2 = vec_splat(vvec, 1);
vector float vvec3 = vec_splat(vvec, 2);
vector float v0 = (vector float) vec_splat_u32(0);
vector float vres1, vres2, vres3, vres;
// Do the vector x matrix multiplication
vres1 = vec_madd(vin1, vvec1, v0);
vres2 = vec_madd(vin2, vvec2, vres1);
vres3 = vec_madd(vin3, vvec3, vres2);
vres = vec_add(vres3, vin4);
// Store back the result
vec_st(vres, 0, (float *)pOut);
}
vec_madd executes in 4 or 5 cycles, and the next cannot be executed until it's finished (vres2 needs vres1 to calculate). This is a long time to wait.
I notice, though, that some of data is NOT dependant - vin1, vin2, vin3, vvec1, vvec2, vvec3 - they are generated by vec_splat in the function "prologue" or loaded from memory.
You couldn't really "gain" performance by moving the vin* variables around, since a memory load will stall the entire function for perhaps hundreds of cycles, but a vec_splat could be moved. In effect you would have;
Code:
void SIMDx86Matrix_VectorMultiplyOf(SIMDx86Vector* pOut, const SIMDx86Vector* pIn, const SIMDx86Matrix* pMat)
{
/*
Does a normal vector x matrix, but since w = 1.0f, the final column of the matrix is
merely added.
*/
vector float vvec2, vvec3;
vector float vres1, vres2, vres3, vres;
// Load SIMDx86 matrix and vector
vector float vvec = vec_ld(0, (float *)pIn);
vector float vin1 = vec_ld(0, &pMat->m[0]);
vector float vin2 = vec_ld(16, &pMat->m[0]);
vector float vin3 = vec_ld(32, &pMat->m[0]);
vector float vin4 = vec_ld(48, &pMat->m[0]);
vector float vvec1 = vec_splat(vvec, 0);
vector float v0 = (vector float) vec_splat_u32(0);
// Do the vector x matrix multiplication
vres1 = vec_madd(vin1, vvec1, v0);
vvec2 = vec_splat(vvec, 1);
vres2 = vec_madd(vin2, vvec2, vres1);
vvec3 = vec_splat(vvec, 2);
vres3 = vec_madd(vin3, vvec3, vres2);
vres = vec_add(vres3, vin4);
// Store back the result
vec_st(vres, 0, (float *)pOut);
}
I realise this is verging on the realms of a micro-optimisation, and I have not run this through SimG4, nor can I actually test the performance here (I only have an Efika and a Pentium M to spare :) but the way I see it, you get a vec_splat for free between each vec_madd, reducing the time taken to go through the function prologue. Given the high latency of the vec_madd and the data dependency, it's impossible that the next vec_madd would be ready before vec_splat is finished, plus they operate in completely different units (vector simple and vector complex). I gain 2 cycles in my head :)
Does that sound sane or insane? :D
I just noticed, that without the sequence of splats, the entire calculation segment is now dependant on the end of the final vec_ld, so it may not matter at all on a cold cache.
Anyway. There is probably a lot you can do with a little fancy reordering..