Quote:
From reading the manual I would think that the 5200 has an internal bus of 64bit x 133Mhz => ~800MB/sec and that it has an external bus has max 32bit x 133 Mhz x 2(DDR) => ~800Mb/sec.
There are some ridiculous number of ways to connect RAM up to the MPC5200B, be it DDR, standard non-DDR SDRAM, 32-bit at a time, 16-bit, 8-bit, in different rank combinations, depending on the exact decisions made by the board designer.
What the manual says is overridden completely and absolutely by holistic system design.
Quote:
Matt, did I understand you correctly that using SDR-Ram or DDR-Ram doesn't make a difference as you can not get more than 133Mhzx32bit anyway?
I am saying the Efika is not designed that way, Arno reads the manual and draws conclusions, I am presenting a holistic view of the *system design and implementation* and not the theory of processor architecture.
There is not 800MB/s to gain from the MPC5200B on the Efika due to the way the RAM is connected up, and the capabilities of the RAM itself. Even if the connection was more advanced (64-bit 133MHz to match the internal XLBus) you would STILL not reach that theoretical maximum. Take a look at the theory and practise in the document Arno posted; an 800MB/s theoretical bus which transfers at 600-640MB/s.
Then, take a leap of sense and imagine.. if you had a memcpy() routine that was relatively unoptimized, you might say that you are doing something very very wrong if you only get 200MB/s out of a potential 600-800MB/s. You may optimize like you did and get 400MB/s as a ceiling; then you wonder where that other performance is hiding.
Well, I am here to tell you, that extra performance never existed. It is a theoretical, internal bus limit, and cannot be exploited due to the limitation of the RAM, the connection to the memory controller, before you even start to talk about latencies and bus protocol.
Gunnar, take your optimisation tests; they have already shown Linux memcpy going from ~300MB/s to ~700MB/s compared to libmotovec or a 64-bit 4-per-loop copy (using floating point registers I expect). You ran this same code on the Efika, and it only got a 10% increase over the already optimized routines using dcbz.
Don't you think if the performance was there to be exploited, that the dcbz/dcbt optimisation and using the widest register width possible would have given the same bandwidth?
Let's do an experiment, to see who's right. Gunnar, grab a Pegasos and set the ICTC throttle to 1, and disable the L2 cache. This brings it down to the level of a G2 core, mostly. Now try doing the memory bandwidth optimizations you did on the Efika...