The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well.
http://infocenter.arm.com/help/index.js ... 03s02.html
The NEON unit can view the same register bank as:
* sixteen 128-bit quadword registers, Q0-Q15
* thirty-two 64-bit doubleword registers, D0-D31.
The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements.
http://infocenter.arm.com/help/index.js ... IIFHA.html
Here it performs 8x16 Bit integer add.
I guess this also works with 4x32 Bit float.
http://infocenter.arm.com/help/index.js ... 03s03.html
For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel.
I did not verify the above with tests so far.
So of course I may misinterpret the docs - if you have different info, please let me know!
The benchmark results seem quite logical actually.
Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement