The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well.
see: http://infocenter.arm.com/help/index.js ... 03s02.html
The NEON unit can view the same register bank as:
* sixteen 128-bit quadword registers, Q0-Q15
* thirty-two 64-bit doubleword registers, D0-D31.
The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements.
Also see:http://infocenter.arm.com/help/index.js ... IIFHA.html
Here it performs 8x16 Bit integer add.
I guess this also works with 4x32 Bit float.
Or: http://infocenter.arm.com/help/index.js ... 03s03.html
For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel.
I did not verify the above with tests so far.
So of course I may misinterpret the docs - if you have different info, please let me know!
It all depends on how one looks at it, I read this differently:http://infocenter.arm.com/help/topic/co ... dgcfe.html
rather, it has a register file of 64-bit registers which it can map to 128-bit registers as well. But only as a matter of convenience, it will sometimes take double the cycles to perform an instruction on a q-word. It depends on the instruction:http://infocenter.arm.com/help/index.js ... 06s06.html
Eg. for fp32 addition, vadd takes one cycle only for a d-word not a q-word. So a 128-bit vadd will take two cycles. Some others take 1 cycle even for a q-word, eg. integer addition.
Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement :wink:
There is, but I think the limit is 1.6GFLOPS not 3.2 :)