All times are UTC-06:00




Post new topic  Reply to topic  [ 10 posts ] 
Author Message
 Post subject: AltiVec Advantages
PostPosted: Thu Oct 14, 2004 11:30 am 
Offline

Joined: Fri Sep 24, 2004 1:39 am
Posts: 5
I just want to share some bullets of some of the Altivec advantages, as requested by R&B.

1.Computational Performance
Altivec is a 128-bit vector execution unit which operates concurrently with existing integer and floating-point units.
Altivec unit accelerates the performance of computationally intensive applications such as image and signal processing.

2.Simultaneous Execution
Altivec unit executes of up to 4 floating point, 8 16-bit, or 16 8-bit operations in a single clock cycle.

3.Computational Efficiency
Altivec unit works with a set of SIMD instructions that dramatically increases the computational efficiency of a PowerPC processor.

4.DSP Performance
Extra transistors dedicated to AltiVec are available for handling DSP tasks, with minimal programming effort.

5.Memmory Efficiency (MMU)
Memory Management Unit lets the processor work with virtual addresses, providing memory protection for embedded real-time applications. Without such protection, a simple memory access violation can cause a complex set of system-level symptoms that can be hard to diagnose.

6.Programming Efficiency & Productivity
AltiVec technology represents a leap in simplifying the programming required to achieve high performance.
Whereas previous DSP-based systems required handcrafted assembly language for optimal performance, easy-to-use extensions to the C language provide a direct mapping to AltiVec instructions.
This permits developers to program more productively in a higher-level language, even for critical sections of code.

do not hasitate to post any of your ideas... 8-)

liquidbit


Top
   
PostPosted: Thu Oct 14, 2004 11:32 am 
Offline

Joined: Sun Oct 03, 2004 6:42 pm
Posts: 0
A really simple description:


Imagine a computation as a train ride.
Each computation needs one carriage and one track.

The Altivec Train Co. has 4 tracks and 4 carriages.
It can thus can operate 4 rides (ie computations) while the other companies can only do one.

Sometimes you can have 4 people per carriage in which case it's 16 at once rather than 1.


Top
   
PostPosted: Thu Oct 14, 2004 11:33 am 
Offline

Joined: Fri Sep 24, 2004 1:39 am
Posts: 5
nice approach.. we could make some sketches relating to this :-)


Top
   
PostPosted: Thu Oct 14, 2004 11:35 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Thanks to both of you and with your illustrative skills liquidbit that might end up being a very cool idea...:-)

R&B


Top
   
PostPosted: Thu Oct 14, 2004 11:36 am 
Offline

Joined: Tue Oct 12, 2004 7:42 am
Posts: 0
1. 16-way SIMD
- Parallel processing
2. 128-bit wide data paths
- Fast data movement
- latency of moving 128-bit data with AltiVec
=
latency of moving 32-bit date with PPC scalar
- Table with big entries
3. 32 “separate” 128-bit registers
- code runnin on the vector registers rather than on memory
- no interference between AltiVec and PPC scalar part
4. vperm capability
- 16-way parallel table lookups
- The magic box for many exotic algorithms
5. Cache management
- Pre-fetch, lock-in, control cache with software


Top
   
PostPosted: Thu Oct 14, 2004 11:37 am 
Offline

Joined: Fri Sep 24, 2004 1:39 am
Posts: 111
Hello Bo,

one thing that always bothered me with Altivec and G4 is the (lack of) bandwidth to memory.

A 1000+MHz 128Bit unit is fed through a <200MHz 64Bit bus (133MHz in Pegasos II). Even with the new MPC86xx generation the internal (and therefore probably less complex to route) MPX bus stayed at 64Bit to feed two 128Bit AltiVec units. Although the Caches can do much good here, at one point they might become starved with data.

Competing CPUs like Athlon64 access their memory with 128Bits now, others already have QDR800 external buses or even DDR1250 if we look at G5.

What is your opinion on that matter or is this no real issue and can always be circumvented by clever coding? How is the quality of currently available C compilers regarding the support of AltiVec in comparison to the support of SSE2?


Top
   
PostPosted: Thu Oct 14, 2004 11:38 am 
Offline

Joined: Tue Oct 12, 2004 7:42 am
Posts: 0
We have done a very extensive study on the imapact of relatively low speed memory system to CPU. To be in mind that G4's LSU is pipelined with 32 bytes as a burst, this means that after a fixed latency from memory to cache, AltiVec reads a vector (16 bytes) in 3 cycles. For sequential access, the difference between access of L1 cache and access of memory is negligible since the pre-fetch engine has already loaded the desired data into cache. Even for random access, by pipelining (partial unrolling) code, we see 16x memory access improvement. Please feel free to contact me and I'll send you the report. The study was conducted on a Pegasos.

I haven't don any comparison between SSE2 and AltiVec but you can find some information on www.altivec.org. The current C compilers for AltiVec are very efficient even those like gcc. Some specialised compilers such as GHS are even better. We submit EEMBC benchmark with GHS but we also required by many customers to provide gcc benchmark for their internal "fair comparison".


Top
   
PostPosted: Thu Oct 14, 2004 11:39 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Addressing the issue of "starving" the processor, this of course is always an issue on all processor models that have slow main memory.

This is why the PowerPC ISA has cache handling instructions which allow software developers to "hint" at data in main memory which should be ready to be used in code later.

These instructions (dcbt, dcbz, dcbf, dcba etc.) allow you to manually control the activity of the cache by "informing" the core that you are about to perform a load, or store, or want to make sure data is flushed to memory, at the best times in your code.

(if you look at any MMX or SSE code, you will see that x86 has these same cache hinting instructions, and they are used a lot, even inserted at choice points by some compilers)

Then the G4 has "data streams" which allow for much more fine-grained hinting of the data you need. These are part of the suite of instructions which make AltiVec so powerful.

Apple have always had some good documentation on this;

http://developer.apple.com/hardware/ve/caches.html

Basically you can tell the core to prefetch data that you need, and set itself up for storing data you have used, and have it happen independantly of instruction execution. You simply "start and forget". You can have 4 of these streams at once.

There is plenty of opportunity for good code to never be starved of data, when you can hide the high latencies of main memory behind pipelining, cache allocation, and data streams.

Neko


Last edited by Neko on Tue Oct 26, 2004 10:56 am, edited 1 time in total.

Top
   
PostPosted: Thu Oct 14, 2004 11:40 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
The basic idea behind AltiVec (and other so-called multimedia extensions) is the SIMD principle: Single Instruction, Multiple Data. This idea was pioneered by the old Cray vector supercomputers. With a single machine instruction, a number of identical, parallel processing units are put to work on many pieces of data all at once.

Modern SIMD differs quite significantly from the old vector computers, but the basic idea remains the same. You (as a programmer) have to organize your data as homogenous blocks (i.e. all data items in the whole group are meant to go through exactly the same steps of processing) and then you can perform calculations on such groups with vector instructions.

The benefit of making this "data parallelism" explicit comes from the fact that the hardware will work in parallel, improving the overall throughput as measured in calculations per second.

How much parallelism you get depends on the kind of data you operate on. The more bits each data item has, the fewer items you can handle in parallel. As mentioned, the AltiVec computational units and the corresponding vector data registers are 128 bits wide. This means a single vector instruction can operate on 4, 8, or 16 data items, which are each 32, 16, or 8 bits wide.

Specific strenghts of AltiVec include, but are not limited to:

- the most comprehensive and complete instruction set (compared to other SIMD extensions). Anything you can think of, you can write in AltiVec in a fairly straight forward manner. Even the things that seem missing can be synthesized with fewer instructions that one would think. (No, there is no double precision floating point support. But PowerPC has had a proper FPU to begin with, so there was no need to replace a stack based 'x87 FPU.)

- the most powerful instruction set. There are many instructions in AltiVec which perform two computations (on each data item) with only a single instruction (pretty much like the PowerPC's fused floating point multiply-add). A single vector instruction can read up to three distinct source operands and combine them to form a single result.

- the most powerful hardware (compared to other SIMD extensions). All vector execution units are true 128 bits wide and fully pipelined. Furthermore, there are four independent execution units (with different responsibilities), and the G4 can sustain a throughput of up to three vector instructions per clock cycle over an indefinite amount of time (this would have to be two computational instructions going to different vector units plus one vector load or store).

- easiest to program for. The C/C++ interface to AltiVec is really all you ever need to know and use. Yes, it is somewhat low level, but it is much easier to write and maintain than assembly, and can realistically bring you within >90% of the performance of perfect machine code, in a fraction of development time. Better yet, vector code is fairly portable between compilers, because the high level interface was standardized by Motorola, not by some compiler vendor.

- clean architecture. The pipelines of the G4+ processor are actually understandable ... maybe not right away, but if you do care to know, everything is documented very well, right there in the MPC7450 User Manual.

I really hesitate to do a detailed comparison to the SSE(1,2,3) implementation in the Pentium 4. Even if I try to be as fair as I can, I will necessary look like an evil SSE basher. Even with three times the core clock frequency (3.6GHz P4 vs. 1.2GHz G4+), SSE is nowhere near a clear cut victory over AltiVec.

The only serious drawback that G4+ processors have is their comparably low memory bandwidth. Improving cache utilisation is not an easy task, but it is as possible (or impossible) for vector code as for conventional scalar code. But there is one thing you should consider: memory bandwidth would seem a lot more adequate if AltiVec did not have that much raw processing power. ;-)

I recommend to start playing with AltiVec. Yes, there is a bit of a learning curve. But Apple has at least two nice tutorials online which should make your first steps quite comfortable on any platform. And then you will see for yourself that a speedup factor of two or three is peanuts for AltiVec, and does not take a lot of work to realize. I am not talking about three man months of expert work to gain a 40% speedup (Intel actually has a whitepaper on that, focusing on MMX;-)).

In case you really catch the optimization bug, you might have the time and inclination to reach much higher speedups than that. My personal record, starting from modestly tuned scalar code, is a factor of 38.5 for the calculation of the cross correlation coefficient of two 8 bit signals. But if you start from bad code, then only the sky is the limit. :-D

In any case, if you want to play with AltiVec and have specific questions, feel free to ask me! Do not be afraid to ask newbie questions; everyone was a newbie once (and I still remember those times). I won't be on ppczone every minute, but I will check back regularly.


Top
   
PostPosted: Thu Oct 14, 2004 11:41 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Thanks to all of you to take the time to write so carefully and thoroughly. We would encourage you to download Bill Dunnigan's presentation from SNDF Europe to understand not only from a technical perspective, but as a compliment to the advantages of AltiVec how Freescale intends to develop and market these advantages with the advances planned for future CPU and microelectronic releases. Bill Dunnigan is VP and General Manager of the Computing Platform Division, NCSG, Freescale. The title of the presentation downloadable from the Freescale website is "High-Performance PowerPC® Processors from Freescale Semiconductor Session P1302." Of special interest to us is the fact that only two companies were featured on their own slides - Tundra and Genesi (#46). :-)

By design Freescale is insuring pin to pin compatibility for each successive generation of CPU. This helps us find purpose for the CPU in hardware designed to do something better. Speed to market is the key and we are finally getting in a position to leverage that. The G4 and the PegasosPPC is 90% a terrific story and 10% not G5, not 64 bit, not XXX certified yet, not available in polka dots. It will not matter if we can fully leverage AltiVec. There are many customers where the 90% will provide 100% of their solution. We think there is a market for the low power, low cost, small footprint this package can offer and we are about to really attack it. AltiVec can open up those opportunities even farther and add shelf-life to successful products for years to come.

R&B :-)


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 10 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group