Power Developer
https://powerdeveloper.org/forums/

Altivec Questions
https://powerdeveloper.org/forums/viewtopic.php?f=23&t=839
Page 1 of 1

Author:  joshuapurcell [ Thu Oct 05, 2006 10:29 pm ]
Post subject:  Altivec Questions

@Neko, thanks for the great response to my questions in the OSW Uncovered thread (located here). I'm trying to understand how each of these projects (Glibc optimization and Freevec) are related, and your post went a long way in helping with that. I figured I'd make a new post in a more appropriate location, although we may need a "Beginners" Altivec section of the forum to be more appropriate :D.
Quote:
Unfortunately most of the interesting optimizations may be CPU-specific (for instance be really good ona G4 but terrible on a G5 due to implementation differences) or host-bridge specific... Those that aren't, probably want to use AltiVec which isn't truly possible in the Linux kernel without losing your performance gains to the context switch model and even then only after disabling kernel preemption which effectively monopolises the kernel to that task..
Sorry about these questions... what you said above brought up more things I'm unsure about.

1.Is the term "PPC-optimization" considered the same thing as optimizing code to use Altivec instructions?

2.How does this relate to VMX optimizations (or VMX128 for that matter)? Would the work being done to optimize for Altivec instructions be easily carried over to these instruction sets (if it isn't directly compatible already) considering both the Cell and the Power ISA v.2.03 specification includes VMX?

3.I'm not quite sure what you meant in the last part of your post (quoted above) regarding optimizing the Linux kernel for Altivec. From what I read, it sounds to me like you said doing these optimizations aren't worth it... so I must be missing something there. I would have to wonder why running software on an architecture that it isn't optimized for is worth it if that were the case. Is this because most of the instructions used in the kernel just wouldn't benefit from Altivec, so there is no need?

Author:  Neko [ Fri Oct 06, 2006 12:33 pm ]
Post subject:  Re: Altivec Questions

Image
Quote:
1.Is the term "PPC-optimization" considered the same thing as optimizing code to use Altivec instructions?
You can optimize code plenty of ways. Anything that makes it faster is valid :)

However in a lot of cases code is simply not done in the best way for the architecture, or any architecture for that matter. There is still code in the world that does memory copies in single bytes, where most processors will support this happily, architectures like Power do better if you transfer 32-bit values at a time, and architectures like ARM or MIPS will throw alignment exceptions which are unfubared by kernel handlers, which causes huge performance losses.

AltiVec operates in a similar way; it prefers 16-byte aligned data which is no big surprise, most 128bit vector units do. However a big difference is that units like SSE on Intel will simply stall and load data if it is not aligned; you get slower code but it works perfectly.

If you throw AltiVec an unaligned address, rather than throw an exception (it was designed to be unintrusive extension, so no exceptions) it just chops off the last 3 bits from the address and loads THAT. If you wanted to load in part of a string;

SIERRAECHOALPHALEMURECHOYANKEE

.. loading from the address of the first E, actually what you would get in your register is the first S and onwards. Then you need to use the permute unit to mix the data around until you have the right values. Aligning your data removes the need to use the permute unit, which is less desirable on a G5 than a G4 as described below. However it is almost always faster to permute data the way you want, and certainly more deterministic and easier to see optimizations in your data management, than let the processor do it for you and have it silently slow down. Obviously keeping your data 16-byte aligned is even faster. Care in managing your data is the key :D

Other tricks are taking advantage of cache line size and the general optimizations in host bridges; most of them will burst transfer a cache line into RAM or out of it, if you do sequential reads OR writes. If you read, write, read, write data, there is no case for optimization. Batching this sort of thing and related tasks like unrolling loops works wonders.

There is also datastreaming (technically part of AltiVec, but not dependant on it) which will allow fine grained control over cache management in large data sets (by 'touching' data in advance you can assure it is in the cache before you even get to the code that accesses it, reducing stalls), and in processors without it there is still the cache management functionality like dcbz which will zero a cache line (32 bytes) in one cycle (I think). This is an easy optimization to something like 'bzero' or memset with a 0 argument.

So anything that 'aligns' code to meet architecture behaviour is good. Power runs generic code very well but when you play to it's design, it excels.

Then you can do things like manually schedule code; Freescale and IBM both have Power processor simulators (SimG4+ for G4 and "Mambo" for G5 and Cell) so you can watch code and see if there are any pipeline bubbles. Clever reorganisation of certain tasks can make sure that the processor is never idle waiting on something else to happen.
Quote:
2.How does this relate to VMX optimizations (or VMX128 for that matter)? Would the work being done to optimize for Altivec instructions be easily carried over to these instruction sets (if it isn't directly compatible already) considering both the Cell and the Power ISA v.2.03 specification includes VMX?
VMX128 is exactly the same as AltiVec. However the instruction scheduling is a little different on G4 and G5, due to the different numbers of units for handling different things. G5 also has greater latencies and more restrictions on how many permute instructions it can pipeline. The code scheduling tools mentioned above mean you can look for these cases and simply code around them.

Cell also lacks out-of-order instruction scheduling so the comments about pipeline bubbles and using the architecture simulator are very relevant on those chips. Standard Power processors will happily reorder non-dependant instructions in the pipeline to keep the units running. Cell will just stop and wait.
Quote:
3.I'm not quite sure what you meant in the last part of your post (quoted above) regarding optimizing the Linux kernel for Altivec. From what I read, it sounds to me like you said doing these optimizations aren't worth it... so I must be missing something there.
The Linux kernel guys say it isn't worth it; Linux handles AltiVec in a very strange way. When you run a process it initially has AltiVec disabled. On the first run of an AltiVec instruction, it causes a processor exception, at which point Linux enables it, runs that instruction again and continues, and starts saving AltiVec registers when it does a context switch, and so on.

Since AltiVec is not saved or restored or even managed properly except in userland, this is the best place for optimization. However, some AltiVec kernel code does exist; some RAID6 code for example, there is TCP/IP checksumming stuff around. They do, as I mentioned before, disable kernel preemption, which can be considered a performance hit, and in the end a lot of the functions that CAN be optimized (memory copies etc.) also get used for tiny and/or unaligned amounts of data, which AltiVec does not handle well.

I see I already answered the last part :)
Quote:
I would have to wonder why running software on an architecture that it isn't optimized for is worth it if that were the case. Is this because most of the instructions used in the kernel just wouldn't benefit from Altivec, so there is no need?
I think there are some great benefits to optimization in the kernel; systems like FreeBSD and NetBSD (and MacOS X and QNX Neutrino..) enable it in kernel space so they are prime candidates for it in my opinion. Linux is a little trickier and will take a lot of work. The first step is to profile the kernel doing a task you think is slow.. and see if there is anything there. RAID6, TCP/IP checksums, any kind of encryption which is vectorizable. However the easy candidates (as above, again) are not suitable in the long run.

Author:  lu_zero [ Fri Oct 06, 2006 7:52 pm ]
Post subject:  Re: Altivec Questions

Image
Quote:
@Neko, thanks for the great response to my questions in the OSW Uncovered thread (located here). I'm trying to understand how each of these projects (Glibc optimization and Freevec) are related, and your post went a long way in helping with that. I figured I'd make a new post in a more appropriate location, although we may need a "Beginners" Altivec section of the forum to be more appropriate :D.
Quote:
Unfortunately most of the interesting optimizations may be CPU-specific (for instance be really good on a G4 but terrible on a G5 due to implementation differences) or host-bridge specific... Those that aren't, probably want to use AltiVec which isn't truly possible in the Linux kernel without losing your performance gains to the context switch model and even then only after disabling kernel preemption which effectively monopolises the kernel to that task..
Sorry about these questions... what you said above brought up more things I'm unsure about.
I'll try to explain a bit.
Quote:
1.Is the term "PPC-optimization" considered the same thing as optimizing code to use Altivec instructions?
I think not only, the ppc has plenty of features you may/should use before even start to think about altivec...
E.G: the memcpy optimization you may found in the glibc with the ppc addon enabled (gentoo's one out of box) are just pure asm and well tuned scheduling, tailored on a per cpu basis (e vey instruction may have a different cost and may be executed together with just _certain other ones), in fact it is available for the G5/power4/power5 but not G4 since it is a different beast.
Quote:
2.How does this relate to VMX optimizations (or VMX128 for that matter)? Would the work being done to optimize for Altivec instructions be easily carried over to these instruction sets (if it isn't directly compatible already) considering both the Cell and the Power ISA v.2.03 specification includes VMX?
VMX == altivec....
Quote:
3.I'm not quite sure what you meant in the last part of your post (quoted above) regarding optimizing the Linux kernel for Altivec. From what I read, it sounds to me like you said doing these optimizations aren't worth it... so I must be missing something there. I would have to wonder why running software on an architecture that it isn't optimized for is worth it if that were the case. Is this because most of the instructions used in the kernel just wouldn't benefit from Altivec, so there is no need?
In short: you have to spend N to save and restore the integer registers and some special purpose ones, if you enable fpu and or altivec in kernel you have to spend more than twice per kernel/user kernel/kernel transition. in order to make it worth you should have at least the 60% of the code using floats/altivec and that isn't possible.

Making it possible using some tricks to enable it just in certain local areas is possible at a cost (you have to manually save+enable altivec, do your stuff, manually disable and restore) of duplicating lots of code.

Shorter: do you really need it that bad?

Author:  bbrv [ Mon Oct 09, 2006 3:45 am ]
Post subject: 

A big thank you to Matt and Luca. Those are great answers.

And, congratulations! You just won another t-shirt. 8)

You are a POWER POSTER! :D

R&B :)

Author:  bbrv [ Thu Oct 12, 2006 8:31 am ]
Post subject: 

Removing "sticky" status, but still IMPORTANT!!! 8)

Page 1 of 1 All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/