Power Developer • Possible benefits - optimization for PowerPC

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

Possible benefits - optimization for PowerPC

Post new topic Reply to topic

Page 3 of 5

[ 73 posts ]

Previous topic | Next topic

Author

Message

tarbos

Post subject:

PostPosted: Tue Dec 18, 2007 5:50 pm

Offline

Joined: Fri Sep 24, 2004 1:39 am
Posts: 113

>As usual Arno is talking crap, reading from the PDF and discarding the reality..

Neko, you should try to improve your social skills as a Genesi "Manager".
Actually I was helping to correct the misrepresentations about bus widths:

RAM bus is 32 Bit DDR264, XL Bus is 64 Bit SDR132 (sic!), so bandwidths correspond nicely.

My example of 845 MB/s is true for a generic 60x bus with unlimited pipelining at 132 MHz.
Of course, if you have bus contention on the shared XLB or cannot do proper pipelining your milage may vary for sustained throughput.

Freescale's Tsi107 test case is a good example here.

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Wed Dec 19, 2007 4:08 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

>As usual Arno is talking crap, reading from the PDF and discarding the reality..

Neko, you should try to improve your social skills as a Genesi "Manager".

Then please accept a pat on the back and a lollipop for being wrong :)

Quote:

Actually I was helping to correct the misrepresentations about bus widths:

RAM bus is 32 Bit DDR264, XL Bus is 64 Bit SDR132 (sic!), so bandwidths correspond nicely.

Even if they were right, it would be unattainable.

The SDRAM controller has 32-bit data path and runs at 133MHz. It may be connected to a DDR chip, but that just means it transfers on high and low of MEM_CLK, giving an effective, theoretical data rate of 266MHz if you could transfer 32-bits of data on every clock.

Each RAM chip is connected via the single data bus, and only 16 lines are connected to each chip. Chips are picked via a chip select line. You can't pick both chips at once. You can only transfer 16 bits of data per clock level, therefore, or 32 bits per clock cycle, under the most ideal pretend circumstances where there is no added latency or wait states involved.

This is a little naive explanation, but it states the case of reality.

Quote:

My example of 845 MB/s is true for a generic 60x bus with unlimited pipelining at 132 MHz.

Arno, how can I express this to you, this bus does not exist, and certainly not on the MPC5200B :)

Quote:

Freescale's Tsi107 test case is a good example here.

It's a good example of how even a 64-bit memory bus at 100MHz, and even the promise of a 133MHz version, doesn't reach your figure, yes.

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Wed Dec 19, 2007 1:18 pm

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Actually I was helping to correct the misrepresentations about bus widths:

RAM bus is 32 Bit DDR264, XL Bus is 64 Bit SDR132 (sic!), so bandwidths correspond nicely.

Hmm, I have to admit that I would interpret the 5200 manual the same way as Arno did.
From reading the manual I would think that the 5200 has an internal bus of 64bit x 133Mhz => ~800MB/sec and that it has an external bus has max 32bit x 133 Mhz x 2(DDR) => ~800Mb/sec.

Maybe its an language thingy :-) of us non-english speakers and misinterpretion caused by this ...
So please don't be too harsh with us.

Matt, did I understand you correctly that using SDR-Ram or DDR-Ram doesn't make a difference as you can not get more than 133Mhzx32bit anyway?

Cheers
Gunnar

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Thu Dec 20, 2007 2:26 pm

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

From reading the manual I would think that the 5200 has an internal bus of 64bit x 133Mhz => ~800MB/sec and that it has an external bus has max 32bit x 133 Mhz x 2(DDR) => ~800Mb/sec.

There are some ridiculous number of ways to connect RAM up to the MPC5200B, be it DDR, standard non-DDR SDRAM, 32-bit at a time, 16-bit, 8-bit, in different rank combinations, depending on the exact decisions made by the board designer.

What the manual says is overridden completely and absolutely by holistic system design.

Quote:

Matt, did I understand you correctly that using SDR-Ram or DDR-Ram doesn't make a difference as you can not get more than 133Mhzx32bit anyway?

I am saying the Efika is not designed that way, Arno reads the manual and draws conclusions, I am presenting a holistic view of the *system design and implementation* and not the theory of processor architecture.

There is not 800MB/s to gain from the MPC5200B on the Efika due to the way the RAM is connected up, and the capabilities of the RAM itself. Even if the connection was more advanced (64-bit 133MHz to match the internal XLBus) you would STILL not reach that theoretical maximum. Take a look at the theory and practise in the document Arno posted; an 800MB/s theoretical bus which transfers at 600-640MB/s.

Then, take a leap of sense and imagine.. if you had a memcpy() routine that was relatively unoptimized, you might say that you are doing something very very wrong if you only get 200MB/s out of a potential 600-800MB/s. You may optimize like you did and get 400MB/s as a ceiling; then you wonder where that other performance is hiding.

Well, I am here to tell you, that extra performance never existed. It is a theoretical, internal bus limit, and cannot be exploited due to the limitation of the RAM, the connection to the memory controller, before you even start to talk about latencies and bus protocol.

Gunnar, take your optimisation tests; they have already shown Linux memcpy going from ~300MB/s to ~700MB/s compared to libmotovec or a 64-bit 4-per-loop copy (using floating point registers I expect). You ran this same code on the Efika, and it only got a 10% increase over the already optimized routines using dcbz.

Don't you think if the performance was there to be exploited, that the dcbz/dcbt optimisation and using the widest register width possible would have given the same bandwidth?

Let's do an experiment, to see who's right. Gunnar, grab a Pegasos and set the ICTC throttle to 1, and disable the L2 cache. This brings it down to the level of a G2 core, mostly. Now try doing the memory bandwidth optimizations you did on the Efika...

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Fri Dec 21, 2007 3:11 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Let's do an experiment, to see who's right.

Please don't get me wrong, I never question that you are right. I only wanted to understand.

I agree with you, that you have the read the manual with a grain of salt. The manual is showing options - but I think because of the pin limitations not all options are possible to use at the same time. I have read this in detail for the 5200 CPU but I know this from experience for the Coldfire CPU, that if you use feature XYZ (e.g PCI) then you can not use feature ABC ...

So its quite possible that for the 5200 CPU there were option to either have PCI, IDE, 32bit memory bus - but not all of them at the same time.

Am I close?

And I agree with you that 350 MB/sec is very good value for this small CPU anyway. At least its more than we had on both the PegasosONE or AmigaONE.

Cheers
Gunnar

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Fri Dec 21, 2007 6:37 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

So its quite possible that for the 5200 CPU there were option to either have PCI, IDE, 32bit memory bus - but not all of them at the same time.

Am I close?

No :)

The SDRAM controller isn't multiplexed with anything else. It's just a limitation of how many modules you need to use and how many ranks you have of them, and the general connection.

Each RAM chip has a 16-bit data bus, and you can only transfer data from one RAM chip at a time, that's the basic gist of it. What you get out of it is exactly what the SDRAM memory bus is, a 32-bit, 133MHz thing, and nothing more.

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Fri Dec 21, 2007 8:35 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

So its quite possible that for the 5200 CPU there were option to either have PCI, IDE, 32bit memory bus - but not all of them at the same time.

Am I close?

No :)

I fear I should give up as I'm not getting it :/

Quote:

The SDRAM controller isn't multiplexed with anything else. It's just a limitation of how many modules you need to use and how many ranks you have of them, and the general connection.

Each RAM chip has a 16-bit data bus, and you can only transfer data from one RAM chip at a time, that's the basic gist of it. What you get out of it is exactly what the SDRAM memory bus is, a 32-bit, 133MHz thing, and nothing more.

I think my main question is:
What is if you (in theory) connect these two memory chips to the same CS, and connect their 32 data lines to different lines on the 5200, can't you use 32bit access then?

I fear I'm to slow thinking to fully get this, without a picture. :-)

Cheers
Gunnar

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Fri Dec 21, 2007 12:49 pm

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

I think my main question is:
What is if you (in theory) connect these two memory chips to the same CS, and connect their 32 data lines to different lines on the 5200, can't you use 32bit access then?

In theory, yes, in practise for most reasonable configurations of RAM and on the Efika, I don't think so.. or at least it doesn't look that way.

But, what would I know about SDRAM design? I just know, that it seems counter intuitive given the information I have, that you can only attain 350MB/s on a memory controller supposedly capable of transferring "800MB/s", and that there is some special design decision to be made that will gain further performance..

Effectively if you had the right chip, the right configuration, and the right circumstances, you could put on a chip selection that would do it, but given the limitations of the controller (there are.. TWO configurations capable of giving a 128MB memory space with 2 chips :) the Efika isn't going to produce it.

_________________
Matt Sealey

Top

Profile

Reply with quote

tarbos

Post subject:

PostPosted: Fri Dec 21, 2007 4:17 pm

Offline

Joined: Fri Sep 24, 2004 1:39 am
Posts: 113

>I don't think so.. or at least it doesn't look that way.

Can you elaborate on that, or more precisely...do you or do you not have exact information from bplan on how the RAM is connected and addressed?

It is standard practice for a DIMM to form a wider bus out of several narrow chips.
This is the only thing that makes sense to me and I cannot believe (yet) EFIKA is a 16 Bit cripple for no appearant reason.

E.g. the Lite5200B docs say it forms a 32 Bit bus out of two 16 Bit chips and uses chip select for _another_ pair.

Last not least Gunnar's 350MB/s sound too good for a 16 Bit bus imho. :o)

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Fri Dec 21, 2007 4:30 pm

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

Can you elaborate on that, or more precisely...do you or do you not have exact information from bplan on how the RAM is connected and addressed?

I have the information.

Quote:

It is standard practice for a DIMM to form a wider bus out of several narrow chips.
This is the only thing that makes sense to me and I cannot believe (yet) EFIKA is a 16 Bit cripple for no appearant reason.

There must be a reason for it. I just don't think if it could have been made better, it is just a fact of the SDRAM controller's specification and how SDRAM interleaving works.

Whatever reason, it is a very good one and can you ever remember a board design from phase5 or bplan that did not have world-class performance? Think back when we all would sell our parents for a Blizzard or Cyberstorm card..

Quote:

Last not least Gunnar's 350MB/s sound too good for a 16 Bit bus imho. :o)

But it does sound good enough for a bus which transfers 32bits on every clock. In fact it sounds just about perfect after you take into account the latencies, tenures etc.

The Lite5200B documentation describes a 2 CS layout which has to - by design - implement 8-bit transfers from each chip. You effectively halve the bandwidth possible from each module and gain it back - although I doubt 100% - by having multiple chips.

However this is not a dual channel design. As far as I can tell from this, you can only have one SDRAM bank transferring at once, and the SDRAM controller buffers it until it has enough so it can pump it back to the XLBus for the transfer width it requested.

I'll just repeat again, I don't think the 800MB/s value is realistic to expect from the SDRAM controller. You have to factor in the chips, their bank size, and the compatibility matrix of the controller. Then the latencies, bus width, buffering, priorities.. remember also there is no L2 cache on this thing so these latencies are NEVER hidden from you as you would expect on a higher class of chip like the G3 or G4. You are looking at the raw, streaming performance of the SDRAM.

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Thu Dec 27, 2007 7:03 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Again, very educating read, thaks to all who participate here.
May I go off topic? Does anybody here know to what extent this CPU specific optimizations exist in MorphOS 2.0 beta?

Sorry, but I don't know if memcpy in MOS is PPC optimized.
I know that in earlier version of MOS memcpy was slow but this was partly on purpose. I've been told that memcpy in OS4 is well optimized.

Maybe you should ask one of the MOS guys the question to get the details for MOS 2.0

Cheers
Gunnar

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Thu Dec 27, 2007 10:11 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Gunnar, that is great stuff. ....
Good work!

R&B :)

Raquel and Bill,
If anyone deserves thanks or credit than it is Freescale and Genesi for providing PPC hardware to developers!

There is no doubt that major performance improvenets can be achieved by simply software optimization.

Here is one comparison showing memcpy on different PPC models.
Memcpy is used a lot on Linux/GNU software.
Every disk read, filesystem cache hit, network packet sending /receiving, or video/gfx operations involve one or more memcopies. Improving memcpy performance has surprisingly high impact on total system performance.

Again many thanks for supporting developers with PPC hardware and allowing people to optimize software for PPC !!

Cheers

Top

Profile

Reply with quote

Chain-Q

Post subject:

PostPosted: Thu Dec 27, 2007 1:09 pm

Offline

Joined: Mon Jan 30, 2006 7:24 am
Posts: 43
Location: Budapest, Hungary

Quote:

Maybe you should ask one of the MOS guys the question to get the details for MOS 2.0

MorphOS 2.0 also has an optimized memory copy, which uses Altivec where available. It also should perform better than previous versions on non-Altivec systems. But i can't provide exact details ATM. On G4 Altivec systems, it can use all bandwidth the bus provides, at least.

Top

Profile

Reply with quote

jcmarcos

Post subject:

PostPosted: Fri Dec 28, 2007 2:12 am

Offline

Joined: Mon Jan 08, 2007 3:40 am
Posts: 195
Location: Pinto, Madrid, Spain

Quote:

MorphOS 2.0 also has an optimized memory copy...

Thank you very much for this information, Karoly. Every non-MorphOS user here should note that it is extremely hard to find information from current developments in 2.0. This is why we value a lot every bit of information we can get.
MorphOS should always perform better than Linux (at the cost of, obviously, doing less things). That AltiVec optimized functions really put the hardware situation to shame: There is no more AltiVec (G4/G5) hardware available today.
Perhaps it makes more sense to have small computers like Efika (no news about MPC5121 based Efika2?), but it's a pity that specific optimizations come out now for CPUs we can no longer enjoy...

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Fri Dec 28, 2007 8:59 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

MorphOS should always perform better than Linux

Memcopy was a bit crippled on MOS 1.4

But you are right.
Because Linux is designed is multi-user system and in this respect very paranoid about security, it has to copy a lot of data between user and kernel space where a Amiga-like OS would not need to do this amount of extra copying.

So by design an Amiga-like OS has a speed advantage.

Quote:

That AltiVec optimized functions really put the hardware situation to shame

Well, using Altivec makes not such a big difference for a normal memcpy. What really slowed memcpy on MOS down was that the prefetch instructions where turned off.

Does anyone know if DCBT & DCBST are still disabled on MOS 2.0?
MOS 1.4 these instructions were NOOPed and thereby the memory performance was very limited.

Cheers
Gunnar

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 3 of 5

[ 73 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum