Quote:
1.Is the term "PPC-optimization" considered the same thing as optimizing code to use Altivec instructions?
You can optimize code plenty of ways. Anything that makes it faster is valid
However in a lot of cases code is simply not done in the best way for the architecture, or any architecture for that matter. There is still code in the world that does memory copies in single bytes, where most processors will support this happily, architectures like Power do better if you transfer 32-bit values at a time, and architectures like ARM or MIPS will throw alignment exceptions which are unfubared by kernel handlers, which causes huge performance losses.
AltiVec operates in a similar way; it prefers 16-byte aligned data which is no big surprise, most 128bit vector units do. However a big difference is that units like SSE on Intel will simply stall and load data if it is not aligned; you get slower code but it works perfectly.
If you throw AltiVec an unaligned address, rather than throw an exception (it was designed to be unintrusive extension, so no exceptions) it just chops off the last 3 bits from the address and loads THAT. If you wanted to load in part of a string;
SIERRAECHOALPHALEMURECHOYANKEE
.. loading from the address of the first E, actually what you would get in your register is the first S and onwards. Then you need to use the permute unit to mix the data around until you have the right values. Aligning your data removes the need to use the permute unit, which is less desirable on a G5 than a G4 as described below. However it is almost always faster to permute data the way you want, and certainly more deterministic and easier to see optimizations in your data management, than let the processor do it for you and have it silently slow down. Obviously keeping your data 16-byte aligned is even faster. Care in managing your data is the key
Other tricks are taking advantage of cache line size and the general optimizations in host bridges; most of them will burst transfer a cache line into RAM or out of it, if you do sequential reads OR writes. If you read, write, read, write data, there is no case for optimization. Batching this sort of thing and related tasks like unrolling loops works wonders.
There is also datastreaming (technically part of AltiVec, but not dependant on it) which will allow fine grained control over cache management in large data sets (by 'touching' data in advance you can assure it is in the cache before you even get to the code that accesses it, reducing stalls), and in processors without it there is still the cache management functionality like dcbz which will zero a cache line (32 bytes) in one cycle (I think). This is an easy optimization to something like 'bzero' or memset with a 0 argument.
So anything that 'aligns' code to meet architecture behaviour is good. Power runs generic code very well but when you play to it's design, it excels.
Then you can do things like manually schedule code; Freescale and IBM both have Power processor simulators (SimG4+ for G4 and "Mambo" for G5 and Cell) so you can watch code and see if there are any pipeline bubbles. Clever reorganisation of certain tasks can make sure that the processor is never idle waiting on something else to happen.
Quote:
2.How does this relate to VMX optimizations (or VMX128 for that matter)? Would the work being done to optimize for Altivec instructions be easily carried over to these instruction sets (if it isn't directly compatible already) considering both the Cell and the Power ISA v.2.03 specification includes VMX?
VMX128 is exactly the same as AltiVec. However the instruction scheduling is a little different on G4 and G5, due to the different numbers of units for handling different things. G5 also has greater latencies and more restrictions on how many permute instructions it can pipeline. The code scheduling tools mentioned above mean you can look for these cases and simply code around them.
Cell also lacks out-of-order instruction scheduling so the comments about pipeline bubbles and using the architecture simulator are very relevant on those chips. Standard Power processors will happily reorder non-dependant instructions in the pipeline to keep the units running. Cell will just stop and wait.
Quote:
3.I'm not quite sure what you meant in the last part of your post (quoted above) regarding optimizing the Linux kernel for Altivec. From what I read, it sounds to me like you said doing these optimizations aren't worth it... so I must be missing something there.
The Linux kernel guys say it isn't worth it; Linux handles AltiVec in a very strange way. When you run a process it initially has AltiVec disabled. On the first run of an AltiVec instruction, it causes a processor exception, at which point Linux enables it, runs that instruction again and continues, and starts saving AltiVec registers when it does a context switch, and so on.
Since AltiVec is not saved or restored or even managed properly except in userland, this is the best place for optimization. However, some AltiVec kernel code does exist; some RAID6 code for example, there is TCP/IP checksumming stuff around. They do, as I mentioned before, disable kernel preemption, which can be considered a performance hit, and in the end a lot of the functions that CAN be optimized (memory copies etc.) also get used for tiny and/or unaligned amounts of data, which AltiVec does not handle well.
I see I already answered the last part
Quote:
I would have to wonder why running software on an architecture that it isn't optimized for is worth it if that were the case. Is this because most of the instructions used in the kernel just wouldn't benefit from Altivec, so there is no need?
I think there are some great benefits to optimization in the kernel; systems like FreeBSD and NetBSD (and MacOS X and QNX Neutrino..) enable it in kernel space so they are prime candidates for it in my opinion. Linux is a little trickier and will take a lot of work. The first step is to profile the kernel doing a task you think is slow.. and see if there is anything there. RAID6, TCP/IP checksums, any kind of encryption which is vectorizable. However the easy candidates (as above, again) are not suitable in the long run.