All times are UTC-06:00




Post new topic  Reply to topic  [ 73 posts ] 
Author Message
 Post subject:
PostPosted: Fri Dec 28, 2007 11:35 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1443
Gunnar, we are going to see if we can ship you an 8610 v2.0 board instead of the 8641D v1.0 we were going to send to you.

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
 Post subject:
PostPosted: Fri Dec 28, 2007 3:12 pm 
Offline

Joined: Thu Feb 16, 2006 8:10 pm
Posts: 98
Quote:
Gunnar, we are going to see if we can ship you an 8610 v2.0 board instead of the 8641D v1.0 we were going to send to you.

R&B :)
i dont really know what to make of those 8610, they dont seem as well speced as the 8641D for the average user/project/product.

theres just been a code patch for them the other day though so progress is being made, again really raw right now it seems.

Freescale MPC8610 HPCD ALSA SoC fabric driver
http://patchwork.ozlabs.org/linuxppc/pa ... e&id=15777

http://patchwork.ozlabs.org/linuxppc/li ... ilter=none

http://www.nabble.com/-PATCH--ASoC-driv ... #a14430593
"...
> Do you ever anticipate having other dma users on the system, such as
> memcpy offload? You'll probably need allocation support for channels
> when that day comes (I ended up writing a simple library for dma channel
> management for that very reason on my platform).

Yes, I plan on updating this driver to work with the standard Freescale "Elo"
device driver, but that will have to wait until that code is in the kernel and
stabilized. The SSI is limited in which DMA channels it can use, anyway.
...
"


Top
   
 Post subject:
PostPosted: Sat Dec 29, 2007 5:45 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX
Quote:
i dont really know what to make of those 8610, they dont seem as well speced as the 8641D for the average user/project/product.
The difference is, the 8641D is not quite as highly integrated in some regards, but moreso in others. If you just want simple graphics and built-in audio (AC97 or I2S) then you get this with the 8610 in the chip.

There are no serial controllers or graphics in the 8641D, so PCI Express (this is a power hog) and seperate audio controllers, which is far more expensive.

High bandwidth communications is the key market for the 8641D, whereas credible multimedia performance is the key market for the 8610 (imaging, audio, kiosks..) but networking is something you need to add seperately. This is good; after all, if you want to do networking apps on wireless, why have 3 gigabit ethernet ports in the chip wasting die space and adding cost? :)

It should be useful for just as much as the Efika is, and hopefully the board could be just as small.

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Mon Dec 31, 2007 5:03 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Gunnar, we are going to see if we can ship you an 8610 v2.0 board instead of the 8641D v1.0 we were going to send to you.

R&B :)
WOW! Cool! Great!

I believe very much in the future of the 8610.

The 8610 has enough CPU power for every task I can think of.
It has much improved memory throughput compared to earlier G4-Apple or Pegasos systems.

I think that it new G4 is truely perfect to build a great desktop system, a laptop, or a very powerfull thinclient around it.

With the improved memory througput of the 86xx-G4 generation I think that such a system will rock. I see the combination of GFX and CPU core into one chip as big improvement booth for cost effective as also for performance.

While a system with seperated GFX card has certain performance advanatages - the connection to the GFX card often is a bottle neck. A combined system does not have this bottleneck and in fact often requires less copying of memory.
I think for many tasks the 8610 will have much supperior performance than for example Apples Ibook/Powerbooks had.

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Mon Dec 31, 2007 7:14 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
With the improved memory througput of the 86xx-G4 generation I think that such a system will rock. I see the combination of GFX and CPU core into one chip as big improvement booth for cost effective as also for performance.
I completely agree with Gunnar, this system will be a very nice addition to the PowerPC developers everywhere. Not everyone needs a Power6 workhorse, and the PS3 though nice, is quite limited in expansion, and 64-bit is not necessary anyways.

Out of curiosity, do we know what kind of gfx subsystem does the 8610 use? Eg. does it include 3D support, and if so, is such a driver available for Linux (and/or other systems as well, if necessary)? And is it opensource? (please, please let it be opensource, closed source drivers are such a waste of time for everyone :-)

Konstantinos


Top
   
 Post subject:
PostPosted: Wed Jan 02, 2008 4:11 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX
Quote:
Out of curiosity, do we know what kind of gfx subsystem does the 8610 use? Eg. does it include 3D support, and if so, is such a driver available for Linux (and/or other systems as well, if necessary)? And is it opensource? (please, please let it be opensource, closed source drivers are such a waste of time for everyone :-)
It's the same display unit as in the MPC5121E but with slightly higher specs (probably down to bus bandwidth available). It's capable of 3 alpha-blended planes and is 2D only, there is no acceleration. If you ditch the alpha blended planes you can do higher resolutions (up to 1280 horizontal rather than 1024 horizontal).

The intention is that you would use AltiVec - I would expect that it supports a 16-bit packed pixel mode to match the AltiVec intrinsic type, as well as full RGBA.

If you are just using a Linux console (for a server) or doing photo viewing, kiosk work (even with complex animation or DVD video) or something like an OLPC (where the display is relatively simple and being able to power manage is more important than having your windows rubberized) then this is absolutely suitable.

No 3D. There is a Linux framebuffer driver floating around for the MPC8610 DIU and MPC5121E DIU somewhere. If you needed 3D then the chip has PCI Express so you can add it as an option..

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 6:39 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX
Gunnar :)

Going back to the memcpy improvements, I guess there are 3 routines needed to complete the full gamut of processors.

1) an integer-only highly optimized memcpy for inside the Linux kernel where floating point and AltiVec are not allowed (or would incur penalties for state management) - and for the low end embeddeds (400 series)

2) a floating point version for the non-AltiVec league of processors (44x, e300, G3)

3) AltiVec optimized (by the way, do datastreams give better performance than the standard dcb* ops?)

Obviously the last two could be manipulated inside glibc by the ports method at runtime to allow the relevant CPU to find the right routine where performance can be further gained, depending on chip features? Is that right?

Is it possible we could roll together some patches so we can build an Efika Linux test bed (maybe against SuSE or Debian or some other packagable system so libraries can easily be reinstalled) and bear out these performance improvements in real use and not just in theory and benchmark graphs? :)

Then, I wonder.. what other performance improvements could you make? memcpy() is the obvious one, but libfreevec managed to make some - while silly like swab() - definite improvements simply based on the algorithm and vectorisation. Are there other parts of glibc which are generally rather deficient, or are optimized differently on x86 (i.e. there is an SSE version or an assembly version for x86 but only C code for every other arch)?

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 8:30 am 
Offline

Joined: Mon Aug 21, 2006 2:57 pm
Posts: 38
Location: Austin, TX, USA
Quote:
Going back to the memcpy improvements, I guess there are 3 routines needed to complete the full gamut of processors.

1) an integer-only highly optimized memcpy for inside the Linux kernel where floating point and AltiVec are not allowed (or would incur penalties for state management) - and for the low end embeddeds (400 series)

2) a floating point version for the non-AltiVec league of processors (44x, e300, G3)

3) AltiVec optimized (by the way, do datastreams give better performance than the standard dcb* ops?)
I think you're forgetting that it's not just the presence or lack of FP/VMX that makes the loops different, but the implementation of the memory hierarchy between the different products. As Gunnar has described already, some benefit from prefetching and/or dcbz:ing, others do not (or need different considerations when using the tricks).

As an example of a complex case: The current in-kernel 64-bit memcpy is optimized for power4 where you have 3 different load/store queues through the caches, so it keeps three streams of loads/stores going at any given time, for example.
Quote:
Obviously the last two could be manipulated inside glibc by the ports method at runtime to allow the relevant CPU to find the right routine where performance can be further gained, depending on chip features? Is that right?
Believe it or not, but glibc already has features for this in 2.7! It adds per-platform search paths to the library path. Unfortunately it seems like the freescale parts have not yet been quite brought up to current on reporting the platform (see the 'platform' fields under the cpu table in arch/powerpc/kernel/cputable.c).

See http://penguinppc.org/dev/glibc/glibc-p ... addon.html for more info (it has since then been merged into mainline glibc).
Quote:
Is it possible we could roll together some patches so we can build an Efika Linux test bed (maybe against SuSE or Debian or some other packagable system so libraries can easily be reinstalled) and bear out these performance improvements in real use and not just in theory and benchmark graphs? :)
Any distro shipping with glibc 2.7 can use the above (you need to tweak it for your platform and enable support for it, but that's not too bad). Besides that, if you want a built-from-scratch, Gentoo is obviously one of the easier ones (but you'll really want distcc if you're building natively on efika :)


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 8:50 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Then, I wonder.. what other performance improvements could you make? memcpy() is the obvious one, but libfreevec managed to make some - while silly like swab() - definite improvements simply based on the algorithm and vectorisation. Are there other parts of glibc which are generally rather deficient, or are optimized differently on x86 (i.e. there is an SSE version or an assembly version for x86 but only C code for every other arch)?
Wrt to libfreevec, i can only say "stay tuned". The 1.0 version is RSN :-)

Konstantinos


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 10:10 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi Neko,
Quote:
Gunnar :)

Going back to the memcpy improvements, I guess there are 3 routines needed to complete the full gamut of processors.

1) an integer-only highly optimized memcpy for inside the Linux kernel where floating point and AltiVec are not allowed (or would incur penalties for state management) - and for the low end embeddeds (400 series)

2) a floating point version for the non-AltiVec league of processors (44x, e300, G3)

3) AltiVec optimized (by the way, do datastreams give better performance than the standard dcb* ops?)

I might be wrong but I feel a good working integer version should be enough for Linux software, shouldn't it?
An optimized integer version for each CPU is easy to integrate into Linux and Glibc.
But I fear that The FPU or Altivec routines are difficult to integrate into Linux.
With proper prefetching I get the glib routine on EFIKA 100% faster. The same is true for PEGASOS. I'm not sure that using Altivec will give a big improvement over good integer code.

Its a bit odd that both Linux and GLIBC have to have memcpy functions - it would be better if these for performance so important functions would be only in one place.

Cheers
Gunnar


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 2:00 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
With proper prefetching I get the glib routine on EFIKA 100% faster. The same is true for PEGASOS. I'm not sure that using Altivec will give a big improvement over good integer code.
In some cases it will, in some it won't make a difference. Memory copying might benefit a bit but not that much, unless one works only with big chunks of in-cache data. On the other hand, other functions (like memchr()), do get a substantial improvement just because Altivec helps in scanning streams of data. But then again only for bigger chunks.

On this topic, I noticed that some software I checked on while testing/optimizing is really retarded in some aspects. Numerous calls to memcpy/memcmp/etc with size arguments of 0/1 are made and the overhead to call the function just to return instantly. I'm talking about MILLIONS of calls (I actually did histograms of the arguments given in the function, with a simple printf() inside my function) where the programmer erroneously does not check for the size of the buffer to check/copy/etc, to eliminate branching, though he doesn't realize that he can't escape branching all the time. Sometimes, it's not about the kernel/library/etc. Some programs really are THAT bad and no matter what optimizations anyone will try it won't make a difference.
Quote:
Its a bit odd that both Linux and GLIBC have to have memcpy functions - it would be better if these for performance so important functions would be only in one place.
The linux kernel -and i think most of the other OS kernels- already include a minimal C library. Right now this is not tuned as much for performance (unless it's for the x86) as for simplicity, small size and portability. I agree with you that it would be in favour of performance if most common routines would be included in the kernel and used from there -as opposed to the current situation where some are in glibc, some in gcc and many are duplicated in each library such as glib, qt, ACE, etc. Even mplayer has its own memcpy routines. But that would be debated by people who are in favour of userspace instead of kernel space. And of course there's the matter of security. If one of these functions would be found to have a vulnerability one would gain access to kernel space, with devastating results.

Konstantinos


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 3:56 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX
Quote:
I think you're forgetting that it's not just the presence or lack of FP/VMX that makes the loops different, but the implementation of the memory hierarchy between the different products. As Gunnar has described already, some benefit from prefetching and/or dcbz:ing, others do not (or need different considerations when using the tricks).
Not forgetting, just simplifying it. There are certain tricks you can do, regardless of the CPU differences, which speed it up. Gunnar's best case routine for e300/G3/G4 sped up G5 and Cell better than the current implementation anyway.

Since you can't - even with a unified powerpc architecture kernel - build a dual 32/64-bit kernel, having different patches for each major architecture would work. At the worst case, the e300 needs more of the performance boost than the G5. You would live a long life missing a couple MB/s from many GB/s already acheivable.

As an example of a complex case: The current in-kernel 64-bit memcpy is optimized for power4 where you have 3 different load/store queues through the caches, so it keeps three streams of loads/stores going at any given time, for example.
Quote:
Quote:
Obviously the last two could be manipulated inside glibc by the ports method at runtime to allow the relevant CPU to find the right routine where performance can be further gained, depending on chip features? Is that right?
Believe it or not, but glibc already has features for this in 2.7! It adds per-platform search paths to the library path.
Indeed, it's glibc-ports. Like I said. We've had this in the back of our heads for ages; since Konstantinos decided to do libfreevec in fact, almost 2 years ago now since his first release..
Quote:
Any distro shipping with glibc 2.7 can use the above
Name one. Last I checked all the main distros ship 2.6 - SuSE 10.3, Fedora Core 8, Debian 4.0 is shipping with glibc 2.3.6 (argh!).
Quote:
Gentoo is obviously one of the easier ones (but you'll really want distcc if you're building natively on efika :)
Yeah we should all run out and spend 8 days compiling an Efika with all the apps and benchmarks we need just so we can test the performance characteristics of an improved memcpy :)

I think it'd be easier if we could ship patches against the current *distributions* and not the mainline code, so we can all see benefits this year and not in 2009.

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Thu Jan 03, 2008 4:11 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX
Quote:
I might be wrong but I feel a good working integer version should be enough for Linux software, shouldn't it?
It should. I figure the best case of the most common denominator should be used - if the e300 improvements also help Cell, then it's a candidate for it.

However optimizing for 128-byte cache lines and 64-bit processors is going to destroy performance on the weaker processors. The balance between how much you prefetch and how much work you do while a prefetch is happening is VERY complicated. But, there are far more e300-class PowerPCs out there than there are POWER5 and Cell... so, please the masses, and not the people who already have spent thousands of dollars on brute force.
Quote:
But I fear that The FPU or Altivec routines are difficult to integrate into Linux.
Well, like I said, you can't use the FPU or AltiVec stuff in the kernel. The context saving required would mean unless you did huge copies, any performance improvement would be swamped.

But, in glibc, this is already done for you (actually the kernel will do it, and the function prologue from the compiler), you don't need to protect any task or do anything weird except copy memory.

I think there has to be a couple of solutions - a userspace to userspace copy in an application could gain some 10% from using FPU or AltiVec registers, you said. I think it's worth that 10% if you have those.

I do think AltiVec is worth it, though; if only if it is using data streams (dstt) rather than standard cache management (dcbt) since you can seperate the prefetching out. A task may use prefetch stream 0 or 1 to prefetch data for an algorithm of it's own (perhaps mpeg decoding or so), while the glibc would be using a higher stream number. This is the recommended way in the AltiVec Programming Environments manual (userspace counts up from 0, kernel/libc counts down from 3) to keep the system software and userspace from stealing each other's streams.

dcb* interact with each other and certain usages can cancel each other out, so you get one chance, and hope the task using glibc memcpy does not do any cache prefetching using dcb*. You also have to hope the kernel doesn't use it.

I think glibc, kernel and then a task using only standard prefetching using dcb* may bring about a net performance loss compared to what is expected, where using data streams will not. The problem is; I have never seen a comprehensive use of data streams benchmarked. Most are very tiny and do not interact with a lot of cache prefetching of either kind.
Quote:
With proper prefetching I get the glib routine on EFIKA 100% faster. The same is true for PEGASOS. I'm not sure that using Altivec will give a big improvement over good integer code.
I think it is down to pipeline bubbles and use of the LSU - you can only do so many things at once. AltiVec helps memcpys because you can use two instructions to do two l/s ops which will fill an entire cache line, in theory, the bus has more opportunity to perform bursts, and if cache aligned, using prefetches and correct interleaving you will get by far the best performance (see libmotovec, the code is disgusting to read, but it leaves absolutely no dead cycles).
Quote:
Its a bit odd that both Linux and GLIBC have to have memcpy functions - it would be better if these for performance so important functions would be only in one place.
Yes you would have thought that it would work better if glibc memcpy() used Linux's memcpy() syscall (this is how anyone would have designed it..) but then, you do not get to use FPU, AltiVec in kernelspace, or perhaps cannot utilize DMA directly from userspace. libc memcpy operates on virtual addresses, kernel memcpy operates on physical, no? You have to trade off some offloads for convenience.

_________________
Matt Sealey


Top
   
 Post subject:
PostPosted: Fri Jan 04, 2008 6:38 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
I agree with you Matt.

We should produce proper integer versions first to get included into LINUX and as addition we could create a library of example codes (Altivec etc) to be used for special performance cases.

I.E.

on CELL,
Using the PPC core I get 4 GB/sec with scalar code
and 4.3 GB/Sec with Altivec code (both COLD CACHE).
Using the SPUs and DMA I get up to 24 GB/s (COLD CACHE) memory throughput - but this SPU routine is only useful for big copies.

On smaller copies and HOT CACHE, I get 24 GB/s with scalar code and 34 GB/s with Altivec PPC code.


But maybe my performance figures are not the max?
Markos, you have CELL/PS3 as well don't you?
What do you get with your Altivec copy?


Top
   
 Post subject:
PostPosted: Fri Jan 04, 2008 7:54 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
I have one question:

For which of these three cases would you optimize the memcpy:

a) SRC and DST data are both NOT in cache.
b) SRC is in cache DST is NOT
c) Both SRC and DST are in cache (no memory traffic needed)

What do you think is most common and which would you optimize for?
I'm asking as maximum optimization for case (a)+(b) will reduce performance for case (c) and visa versa.


Cheers
Gunnar


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 73 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group