The basic idea behind AltiVec (and other so-called multimedia extensions) is the SIMD principle: Single Instruction, Multiple Data. This idea was pioneered by the old Cray vector supercomputers. With a single machine instruction, a number of identical, parallel processing units are put to work on many pieces of data all at once.
Modern SIMD differs quite significantly from the old vector computers, but the basic idea remains the same. You (as a programmer) have to organize your data as homogenous blocks (i.e. all data items in the whole group are meant to go through exactly the same steps of processing) and then you can perform calculations on such groups with vector instructions.
The benefit of making this "data parallelism" explicit comes from the fact that the hardware will work in parallel, improving the overall throughput as measured in calculations per second.
How much parallelism you get depends on the kind of data you operate on. The more bits each data item has, the fewer items you can handle in parallel. As mentioned, the AltiVec computational units and the corresponding vector data registers are 128 bits wide. This means a single vector instruction can operate on 4, 8, or 16 data items, which are each 32, 16, or 8 bits wide.
Specific strenghts of AltiVec include, but are not limited to:
- the most comprehensive and complete instruction set (compared to other SIMD extensions). Anything you can think of, you can write in AltiVec in a fairly straight forward manner. Even the things that seem missing can be synthesized with fewer instructions that one would think. (No, there is no double precision floating point support. But PowerPC has had a proper FPU to begin with, so there was no need to replace a stack based 'x87 FPU.)
- the most powerful instruction set. There are many instructions in AltiVec which perform two computations (on each data item) with only a single instruction (pretty much like the PowerPC's fused floating point multiply-add). A single vector instruction can read up to three distinct source operands and combine them to form a single result.
- the most powerful hardware (compared to other SIMD extensions). All vector execution units are true 128 bits wide and fully pipelined. Furthermore, there are four independent execution units (with different responsibilities), and the G4 can sustain a throughput of up to three vector instructions per clock cycle over an indefinite amount of time (this would have to be two computational instructions going to different vector units plus one vector load or store).
- easiest to program for. The C/C++ interface to AltiVec is really all you ever need to know and use. Yes, it is somewhat low level, but it is much easier to write and maintain than assembly, and can realistically bring you within >90% of the performance of perfect machine code, in a fraction of development time. Better yet, vector code is fairly portable between compilers, because the high level interface was standardized by Motorola, not by some compiler vendor.
- clean architecture. The pipelines of the G4+ processor are actually understandable ... maybe not right away, but if you do care to know, everything is documented very well, right there in the MPC7450 User Manual.
I really hesitate to do a detailed comparison to the SSE(1,2,3) implementation in the Pentium 4. Even if I try to be as fair as I can, I will necessary look like an evil SSE basher. Even with three times the core clock frequency (3.6GHz P4 vs. 1.2GHz G4+), SSE is nowhere near a clear cut victory over AltiVec.
The only serious drawback that G4+ processors have is their comparably low memory bandwidth. Improving cache utilisation is not an easy task, but it is as possible (or impossible) for vector code as for conventional scalar code. But there is one thing you should consider: memory bandwidth would seem a lot more adequate if AltiVec did not have that much raw processing power.
I recommend to start playing with AltiVec. Yes, there is a bit of a learning curve. But Apple has at least two nice tutorials online which should make your first steps quite comfortable on any platform. And then you will see for yourself that a speedup factor of two or three is peanuts for AltiVec, and does not take a lot of work to realize. I am not talking about three man months of expert work to gain a 40% speedup (Intel actually has a whitepaper on that, focusing on MMX;-)).
In case you really catch the optimization bug, you might have the time and inclination to reach much higher speedups than that. My personal record, starting from modestly tuned scalar code, is a factor of 38.5 for the calculation of the cross correlation coefficient of two 8 bit signals. But if you start from bad code, then only the sky is the limit.
In any case, if you want to play with AltiVec and have specific questions, feel free to ask me! Do not be afraid to ask newbie questions; everyone was a newbie once (and I still remember those times). I won't be on ppczone every minute, but I will check back regularly.