All times are UTC-06:00




Post new topic  Reply to topic  [ 22 posts ] 
Author Message
 Post subject:
PostPosted: Tue Mar 09, 2010 3:31 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
> the iMX515 can do 1.6GFLOPS

The NEON Pipeline has 4 Single Precision FP Multiply Units and 4 Accumulators ... it could handle 4 Floats/Cycle.

So shouldn't this be 3.2 GFLOPS or am I missing something?
NEON's registers are 64-bit wide, so while I may issue a vaddq_f32 instruction (which performs its operation on 4x32-bit floats) it does the addition 64 bits at a time not 128 bits like true 128-bit SIMD untis -like AltiVec or SSE- do. The benchmark results seem quite logical actually.


Top
   
 Post subject:
PostPosted: Tue Mar 09, 2010 5:33 pm 
Offline

Joined: Tue Mar 09, 2010 10:41 am
Posts: 19
The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well.

see:
http://infocenter.arm.com/help/index.js ... 03s02.html
Quote:
The NEON unit can view the same register bank as:
* sixteen 128-bit quadword registers, Q0-Q15
* thirty-two 64-bit doubleword registers, D0-D31.

The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements.

Also see:
http://infocenter.arm.com/help/index.js ... IIFHA.html

Here it performs 8x16 Bit integer add.
I guess this also works with 4x32 Bit float.


Or:
http://infocenter.arm.com/help/index.js ... 03s03.html
Quote:
For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel.

I did not verify the above with tests so far.
So of course I may misinterpret the docs - if you have different info, please let me know!

Quote:
The benchmark results seem quite logical actually.
Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement :wink:


Top
   
 Post subject:
PostPosted: Tue Mar 09, 2010 6:36 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well.

see:
http://infocenter.arm.com/help/index.js ... 03s02.html
Quote:
The NEON unit can view the same register bank as:
* sixteen 128-bit quadword registers, Q0-Q15
* thirty-two 64-bit doubleword registers, D0-D31.

The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements.

Also see:
http://infocenter.arm.com/help/index.js ... IIFHA.html

Here it performs 8x16 Bit integer add.
I guess this also works with 4x32 Bit float.


Or:
http://infocenter.arm.com/help/index.js ... 03s03.html
Quote:
For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel.

I did not verify the above with tests so far.
So of course I may misinterpret the docs - if you have different info, please let me know!
It all depends on how one looks at it, I read this differently:

http://infocenter.arm.com/help/topic/co ... dgcfe.html

rather, it has a register file of 64-bit registers which it can map to 128-bit registers as well. But only as a matter of convenience, it will sometimes take double the cycles to perform an instruction on a q-word. It depends on the instruction:

http://infocenter.arm.com/help/index.js ... 06s06.html

Eg. for fp32 addition, vadd takes one cycle only for a d-word not a q-word. So a 128-bit vadd will take two cycles. Some others take 1 cycle even for a q-word, eg. integer addition.
Quote:
Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement :wink:
There is, but I think the limit is 1.6GFLOPS not 3.2 :)


Top
   
 Post subject:
PostPosted: Wed Mar 10, 2010 4:38 am 
Offline

Joined: Tue Mar 09, 2010 10:41 am
Posts: 19
Quote:
It depends on the instruction:
Ah OK I see - you're right.

Well, at least most ALU instructions seem to be single cycle and 1.6 GFLOPS is also not too bad for a low power CPU :)


Top
   
 Post subject:
PostPosted: Wed Mar 10, 2010 4:41 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Quote:
It depends on the instruction:
Ah OK I see - you're right.

Well, at least most ALU instructions seem to be single cycle and 1.6 GFLOPS is also not too bad for a low power CPU :)
Here is a quote from a guy inside ARM -I asked him yesterday as I wanted to be sure myself :)
Quote:
The dual-issue on A8 is limited to
- one NEON ALU op
- one NEON load/store/permute instr (e.g. vld/vst/vmov/vext sort of thing)

So its 8 ops/cycle if you count add of 8bit values
For F32 it would be 2 ops/cycle because of single-issue of ALU ops
I hope this clarifies things a bit.


Top
   
 Post subject:
PostPosted: Wed Mar 10, 2010 9:52 am 
Offline

Joined: Tue Mar 09, 2010 10:41 am
Posts: 19
Indeed, it does.
Thanks for that info!


Top
   
PostPosted: Fri Jun 17, 2011 2:28 am 
Offline

Joined: Fri Jun 03, 2011 9:14 am
Posts: 22
Quote:
Quote:
Quote:
I didn't expect NEON was so good ...
Well, that's progress. You can't be the best always. Modern machines have to be better than older ones. By the way, great to see these progresses Konstantinos!
Altivec is still king though, check these results on the G4:

Scalar:
$ ./bench_gemm
eigen cpu 2.65264s 0.809565 GFLOPS (13.283s)
eigen real 2.6532s 0.809394 GFLOPS (13.2863s)

Altivec:
$ ./bench_gemm
eigen cpu 1.17936s 1.82088 GFLOPS (5.90097s)
eigen real 1.17959s 1.82054 GFLOPS (5.90304s)

But have in mind that PowerPC support is much better and more mature than for ARM (esp. wrt NEON) and that PowerPC is slightly faster at 1Ghz. Theoritically the G4 can do 4GFLOPS at fp math and the iMX515 can do 1.6GFLOPS.
So it looks like dual-core A9 will be able to get to G4 performance level in about a fraction of its power consumption. Great!


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 22 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group