Power Developer • Possible benefits - optimization for PowerPC

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

Possible benefits - optimization for PowerPC

Post new topic Reply to topic

Page 2 of 5

[ 73 posts ]

Previous topic | Next topic

Author

Message

Neko

Post subject:

PostPosted: Fri Dec 14, 2007 10:31 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

I tweaked the tiny EFIKA memcpy a little bit and got nearly 350Mb/sec now! I have updated the above barchart accordingly.

I don't know how I did this but the routine is now 100% faster than the glibc code. :-)

350 MB/sec is quite nice, its even a bit more than I got out of the AmigaONE-G4.

Well, if one thing is for certain, Freescale sure do know how to build memory controllers.

I am simply gobsmacked that the code in glibc and the Linux kernel is *that bad* to begin with. It's mind-boggling..

I wonder, can we think of some benchmarks which are a little more real-world? For instance, if certain application types which do memory copies show appreciable speedups, or if the context switch time goes down for tasks, or if the Efika actually *feels* more responsive, objectively?

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Fri Dec 14, 2007 1:52 pm

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

Well, if one thing is for certain, Freescale sure do know how to build memory controllers.

Yes, 350 MB/sec is a real good value for this small CPU.
But maybe 350 MB is not the limit?
I'm curious if we can increase this further. :-)

Do you know what throughput Limit the 5200 has got?
Is there still room for improvements or is the bus saturated?

Quote:

I am simply gobsmacked that the code in glibc and the Linux kernel is *that bad* to begin with. It's mind-boggling..

Quote:

I wonder, can we think of some benchmarks which are a little more real-world? For instance, if certain application types which do memory copies show appreciable speedups, or if the context switch time goes down for tasks, or if the Efika actually *feels* more responsive, objectively?

I agree with you.
Actually I did some "real-world" benchmarks based on this already. :-)

On CELL, the performance difference between my routine and the Linux kernel was quite huge. So I first benchmarked some apps on CELL.

After patching the Linux kernel a friend and me were testing the network performance and filesystem (tmpfs) performance.

I saw the following improvements:
- Three different iperf test setups archieved from 10% to 47% better network performance!
- The tempfs improvement was around 20%. (But this number is out of my memory, have to check it again)

I'm sure that this work on PPC optimization is worth doing.
My small patch showed so big improvement on real world tasks already, as network or filesystem (filesystem cache) speedup.

There is no doubt that applying a few performance patches will result on "feelable" speedup for Linux.

Top

Profile

Reply with quote

PurpleAlien

Post subject:

PostPosted: Fri Dec 14, 2007 2:07 pm

Offline

Genesi

Joined: Mon Jan 30, 2006 2:28 am
Posts: 409
Location: Finland

Hi all.

Now this is really cool stuff (and fun). I would gladly join a competition to see how fast we can get :-)

The only thing I'm lacking is time... but hey, X-mas is coming up, so things should quiet down a little.

Nice work!

Johan.

_________________
Johan Dams, Genesi USA Inc.
Director, Software Engineering

Yep, I have a blog... PurpleAlienPlanet

Top

Profile

Reply with quote

bbrv

Post subject:

PostPosted: Fri Dec 14, 2007 2:15 pm

Offline

Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1443

Gunnar, that is great stuff. We still have that 8641D development machine sitting here. We will do our best to ship it off to you before the Holidays really get started.

Good work!

R&B :)

_________________
http://bbrv.blogspot.com

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Sat Dec 15, 2007 6:15 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

Well, if one thing is for certain, Freescale sure do know how to build memory controllers.

Yes, 350 MB/sec is a real good value for this small CPU.
But maybe 350 MB is not the limit?
I'm curious if we can increase this further. :-)

I'm not sure it depends on the RAM speed and the bus width chosen. I think the Efika is using 16bit paths from DRAM to the controller, it has a 133MHz 32-bit bus (XLB) connected to DDR-266 RAM (effectively meeting the requirement there), so in theory you are looking at a rate of just under 500MB/s. 350MB/s is therefore probably the best practical rate you can get - I would not expect it to get much higher.

Quote:

Do you know what throughput Limit the 5200 has got? Is there still room for improvements or is the bus saturated?

See above. Given the connection to RAM, the speed of the modules, and the limitations of the XL Bus (which may also be saturated by other peripherals at certain times) I think 350MB/s is probably very very good, and optimizing it to be faster might not be worth it since at any time, disk, PCI, and the PSC's can take time away from the processor on the XLB which would reduce the efficiency.

Quote:

I think its fair to say that Glibc and Linux are not optimized for all CPUs so far. For some CPUs the routines are quote good. E.g The copy routines for Power4 and Power6 looked very optimized to me.
I've tested on Power5 the other day and the glibc did get fantastic throughput results on this CPU.

-snip-

I saw the following improvements:
- Three different iperf test setups archieved from 10% to 47% better network performance!
- The tempfs improvement was around 20%. (But this number is out of my memory, have to check it again)

What stuns me is that the processors that really need the performance - the lower end ones - are not catered for. The POWER server class chips are obviously fulfilled by multi-GHz buses, you are going to get an incredible amount of bandwidth out of them even if you use the least efficient method.

If what you say is correct (anywhere between 10% and just under 50% improvement in overall system 'performance') then this is an incredible increase which will really benefit applications on the MPC5200B.

Quote:

There is no doubt that applying a few performance patches will result on "feelable" speedup for Linux.

I would be very interested in seeing the code (also so that others may improve it or show some other methods perhaps) as a patch for Linux. There may also be parts of Linux dealing with memory copies which use different paths, and could also be optimized for better performance.

_________________
Matt Sealey

Top

Profile

Reply with quote

tarbos

Post subject:

PostPosted: Sat Dec 15, 2007 10:20 am

Offline

Joined: Fri Sep 24, 2004 1:39 am
Posts: 113

>I think the Efika is using 16bit paths from DRAM to the controller,
>it has a 133MHz 32-bit bus (XLB) connected to DDR-266 RAM (effectively meeting the requirement there),
>so in theory you are looking at a rate of just under 500MB/s.

The MPC5200B User Manual says there exists a 32 Bit path to DDR RAM and 64 Bit XLB.
(Page 138)

If we just assume a 60x bus like pipelining with 4 data transfers every 5 cycles, the maximum bandwidth should be about 845 MB/s.

Is Efika using this 32 Bit path or really an extra-slow 16 Bit one??

.::Update::.
After reviewing the data traces and the memory chip specs I come to the conclusion Efika uses a 32 Bit bus.
It looks like the chips on the top side are connected with 16 Bit each while populating the bottom side requires 8 Bit RAM chips.
So 2 x 16 Bit = 32 Bit or an optional 4 x 8 Bit = 32 Bit

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Mon Dec 17, 2007 3:20 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

>
If we just assume a 60x bus like pipelining with 4 data transfers every 5 cycles, the maximum bandwidth should be about 845 MB/s.

~800 MB/sec would be a super performance!
Its in the range that we have seen in the G4-Peg2 or G4-Macs.

If this stellar value is really be possible then it would make sense to look much deeper into code optimization.

Could someone please be so kind and give a clear statement what the limit is (besides the sky)

I'm not 100% sure but I think the 5200 will always do 32byte burst of 8 access

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Mon Dec 17, 2007 5:21 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

>
If we just assume a 60x bus like pipelining with 4 data transfers every 5 cycles, the maximum bandwidth should be about 845 MB/s.

~800 MB/sec would be a super performance!

The manual says what is possible it does NOT say what is implemented on the board. They can and will be completely different things if the designer chooses so.

The 60x bus is what is implemented in the chip as the XLBus - at the best case the memory clock attached to it, and the SDRAM bus is 32bit and does, like Gunnar says, burst transfers when you access a cache line worth of data. That means, for a 133MHz internal 32-bit bus with a 4/5 cycle for data coming out, I get.. 425MB/s.

Not 845MB/s. Not even close to that. That doesn't factor in RAM precharge time, and other RAM access latencies.

The XLBus connects *all* other buses in the chip, so the maximum data rate of the XLB may well be 800MB/s or thereabouts, but it is taking into account PCI, USB, DMA transfers, PSC activity (via the IPB), etc.

The performance you are looking for is the memory controller. I doubt it will be specced to provide far better memory bandwidth than the XLB can hold.

Quote:

If this stellar value is really be possible then it would make sense to look much deeper into code optimization.

As usual Arno is talking crap, reading from the PDF and discarding the reality.. I think 350MB/s is a reasonable performance increase, and work should be done on integrating this now.

Otherwise we could sit here for months and months and nobody will see the benefit while we all debate the finer points of bus bandwidth. If it's going to turn into that, then we're wasting time.

Consider 350MB/s the fastest you will get for now and work on getting the patches into the kernel or glibc or other relevant places. Work on some other speed increases in other areas perhaps. Do not try and reach an unattainable theoretical goal.

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Mon Dec 17, 2007 7:09 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Here is the code of the (real simple) e300 memcpy.
The code is in style very close to the routine used by the linux kernel. Please ignore the numeric labels (these are neded for Linux)

My cooking recipe for copy performance is quite simple.

We align the DST so that the CPU can write out a cache line per loop iteration. The DCBZ instruction is used to clear the DST cache line (this will prevent the CPU form reading the DST before overwriting it)
PPC has copy back cache, so if we copy 1 Byte from $100 to $200 the CPU will always write out a cache line (32byte).
If we write the first byte to $200 the CPU will read the 32byte from $200 to its cache, then overwrite the one byte and later write back the cache line out to memory.

This behavior is needed if we only write single bytes but as we want to write the whole cache line reading the DST in is unnecessary. The DCBZ instruction is the PPC way of telling the CPU that it does not need to read the DST in as we are going to overwrite it all.

To avoid read bubbles we prefetch the SRC using the DCBT instruction.

A little voodoo comes by using the best prefetch ranges for SRC and DST.
The e300 seems to not like prefetching far in advance.
Even prefetching just 2 lines in advance will hurt performance. We align our Prefetch-Pointer to the next SRC cache line.

The main loop uses 4 registers.
Less or more than 4 registers seem to be negative to performance.

I'm sure this routine can be improved further.

Code:

#define L1_CACHE_SHIFT          5

#define MAX_COPY_PREFETCH       4

#define L1_CACHE_BYTES          (1 << L1_CACHE_SHIFT)

CACHELINE_BYTES = L1_CACHE_BYTES

LG_CACHELINE_BYTES = L1_CACHE_SHIFT

CACHELINE_MASK = (L1_CACHE_BYTES-1)

/*

 * Memcpy optimized for PPC e300

 *

 * This relative simple memcpy does the following to optimize performance

 *

 * For sizes > 32 byte:

 * DST is aligned to 32bit boundary - using 8bit copies

 * DST is aligned to cache line boundary (32byte) - using aligned 32bit copies

 * The main copy loop prossess one cache line (32byte) per iteration

 * The DST cacheline is clear using DCBZ

 * The clearing of the aligned DST cache line is very important for performance

 * it prevents the CPU from fetching the DST line from memory - this saves 33% of memory accesses.

 * To optimize SRC read performance the SRC is prefetched using DCBT

 *

 * The trick for getting good performance is to use a good match of prefetch distance

 * for SRC reading and for DST clearing.

 * Typically you DCBZ the DST 0 or 1 cache line ahead

 * Typically you DCBT the SRC 2 - 4 cache lines ahaed

 * on the e300 prefetching the SRC too far ahead will be slower than not prefetching at all.

 *

 * We use  DCBZ DST[0]  and DBCT SRC[0-1] depending on the SRC alignment

 *

 */

.align 7

/* parameters r3=DST, r4=SRC, r5=size */

/* returns r3=0 */

.global memcpy_e300

memcpy_e300:

        dcbt    0,r4                            /* Prefetch SRC cache line 32byte */

        neg     r0,r3                           /* DST alignment */

        addi    r4,r4,-4

        andi.   r0,r0,CACHELINE_MASK            /* # of bytes away from cache line boundary */

        addi    r6,r3,-4

        cmplw   cr1,r5,r0                       /* is this more than total to do? */

        beq     .Lcachelinealigned

        blt     cr1,.Lcopyrest                  /* if not much to do */

        andi.   r8,r0,3                         /* get it word-aligned first */

        mtctr   r8

        beq+    .Ldstwordaligned

.Laligntoword:  

70:     lbz     r9,4(r4)                        /* we copy bytes (8bit) 0-3  */ 

71:     stb     r9,4(r6)                        /* to get the DST 32bit aligned */

        addi    r4,r4,1

        addi    r6,r6,1

        bdnz    .Laligntoword

.Ldstwordaligned:

        subf    r5,r0,r5

        srwi.   r0,r0,2

        mtctr   r0

        beq     .Lcachelinealigned

.Laligntocacheline:

72:     lwzu    r9,4(r4)                        /* do copy 32bit words (0-7) */

73:     stwu    r9,4(r6)                        /* to get DST cache line aligned (32byte) */

        bdnz    .Laligntocacheline

.Lcachelinealigned:

        srwi.   r0,r5,LG_CACHELINE_BYTES        /* # complete cachelines */

        clrlwi  r5,r5,32-LG_CACHELINE_BYTES

        li      r11,32+4

        beq     .Lcopyrest

        addi    r3,r4,4                         /* Find out which SRC cacheline to prefetch */

        neg     r3,r3   

        andi.   r3,r3,31

        addi    r3,r3,32

        mtctr   r0

.align 7

.Lloop:                                         /* the main body of the cacheline loop */

        dcbt    r3,r4                           /* SRC cache line prefetch */

        dcbz    r11,r6                          /* clear DST cache line */

        lwz     r7, 0x04(r4)                    /* copy using a 4 register stride for best performance on e300 */

        lwz     r8, 0x08(r4)

        lwz     r9, 0x0c(r4)

        lwz     r10,0x10(r4)

        stw     r7, 0x04(r6)

        stw     r8, 0x08(r6)

        stw     r9, 0x0c(r6)

        stw     r10,0x10(r6)

        lwz     r7, 0x14(r4)

        lwz     r8, 0x18(r4)

        lwz     r9, 0x1c(r4)

        lwzu    r10,0x20(r4)

        stw     r7, 0x14(r6)

        stw     r8, 0x18(r6)

        stw     r9, 0x1c(r6)

        stwu    r10,0x20(r6)

        bdnz    .Lloop

.Lcopyrest:            

        srwi.   r0,r5,2

        mtctr   r0

        beq     .Llastbytes

.Lcopywords:    

30:     lwzu    r0,4(r4)                        /* we copy remaining words (0-7) */

31:     stwu    r0,4(r6)   

        bdnz    .Lcopywords

.Llastbytes:

        andi.   r0,r5,3

        mtctr   r0

        beq+    .Lend

.Lcopybytes:    

40:     lbz     r0,4(r4)                        /* we copy remaining bytes (0-3)  */

41:     stb     r0,4(r6)

        addi    r4,r4,1

        addi    r6,r6,1

        bdnz    .Lcopybytes

.Lend:  li      r3,0                            /* done : return 0 for Linux / DST for glibc*/

        blr

I'm looking forward to your replies / ideas

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Mon Dec 17, 2007 10:50 am

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

A little voodoo comes by using the best prefetch ranges for SRC and DST.
The e300 seems to not like prefetching far in advance.
Even prefetching just 2 lines in advance will hurt performance. We align our Prefetch-Pointer to the next SRC cache line.

Just for reference, this is not voodoo, but in the processor reference manual (G2CORERM.pdf):

Quote:

4.3.4 Data Cache Touch Load Support
Touch load operations allow an instruction stream to prefetch data from memory prior to a
cache miss. The G2 core supports touch load operations through a temporary cache block
buffer located between the BIU and the data cache. The cache block buffer is essentially a
floating cache block that is loaded by the BIU on a touch load operation, and is then read
by a load instruction that requests that data. After a touch load completes on the bus, the
BIU continues to compare the touch load address with subsequent load requests from the
data cache. If the load address matches the touch load address in the BIU, the data is
forwarded to the data cache from the touch load buffer, the read from memory is canceled,
and the touch load address buffer is invalidated.

To avoid the storage of stale data in the touch load buffer, touch load requests that are
mapped as write-through or caching-inhibited by the MMU are treated as no-ops by the
BIU. Also, subsequent load instructions after a touch load that are mapped as write-through
or caching-inhibited do not hit in the touch load buffer, and cause the touch load buffer to
be invalidated on a matching address.

While the G2 core provides only a single cache block buffer, other microprocessor
implementations may provide buffering for more than one cache block. Programs written
for other implementations may issue several dcbt or dcbtst instructions sequentially,
reducing the performance if executed on the G2 core. To improve performance in these
situations, HID0[NOOPTI] (bit 31) can be set. This causes the dcbt and dcbtst instructions
to be treated as no-ops, cause no bus activity, and incur only one processor clock cycle of
execution latency. NOOPTI is cleared at a power-on reset, enabling the use of the dcbt and
dcbtst instructions.

And for reference:

Quote:

4.8 Cache Control Instructions

(cut)

Note that in the PowerPC architecture, the term â€˜cache blockâ€™ or â€˜block,â€™ when used in the
context of cache implementations, refers to the unit of memory at which coherency is
maintained. For the G2 core, this is the eight-word cache line. This value may be different
for other implementations that support the PowerPC architecture.

So, there is a 32-byte ('cache block') buffer in the BIU which is used for dcbt and dcbst etc. - if you load in more than one cache line ahead of yourself, you will obviously have to re-read it later anyway as it will neither exist in the cache (dcbt instructions will cancel each other) or in the buffer.

_________________
Matt Sealey

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Mon Dec 17, 2007 11:41 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Neko, you take the whole magic out of hacking :(

More seriously :-)
Many thanks for the great explanation !
Now, the behavior makes perfect sense.

There is another think I'm wondering about:

The memory throughput varies a little bit depending on the alignment of the source. The copy is in fact a bit slower when the source is fully aligned to the DST (i.e. copy from 1000 to 2000) than when the source is slightly misaligned.

I think the copy has still a "memory bubble" in the full aligned case.

I noticed as well that when working on well aligned source data, you can increase throughout by about 10% using FLOAT registers. But FLOAT can (should) of course not used on misaligned SRC or DST as the alignment errors will kill you.

Do you have any thought on this?

Cheers

Top

Profile

Reply with quote

Neko

Post subject:

PostPosted: Mon Dec 17, 2007 1:25 pm

Offline

Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1594
Location: Austin, TX

Quote:

I noticed as well that when working on well aligned source data, you can increase throughout by about 10% using FLOAT registers. But FLOAT can (should) of course not used on misaligned SRC or DST as the alignment errors will kill you.

Do you have any thought on this?

Yeah, you can't use floating point registers (or AltiVec) easily in the Linux kernel, it being especially dangerous in critical areas like the generic memcpy (which may be used inside interrupts etc. for all you'd know). Therefore you'd only be able to benefit from it in userspace.

I'm not sure about the code not working as fast when given aligned data. Perhaps it is something that could be resolved down to the overhead of checking for misalignment, or branch prediction errors? Or is it quite large?

_________________
Matt Sealey

Top

Profile

Reply with quote

jcmarcos

Post subject:

PostPosted: Tue Dec 18, 2007 9:38 am

Offline

Joined: Mon Jan 08, 2007 3:40 am
Posts: 195
Location: Pinto, Madrid, Spain

Again, very educating read, thaks to all who participate here.
May I go off topic? Does anybody here know to what extent this CPU specific optimizations exist in MorphOS 2.0 beta?

Top

Profile

Reply with quote

kaltst

Post subject:

PostPosted: Tue Dec 18, 2007 11:26 am

Offline

Joined: Tue Nov 02, 2004 6:17 am
Posts: 28

Hi Neko,

Quote:

I'm not sure it depends on the RAM speed and the bus width chosen. I think the Efika is using 16bit paths from DRAM to the controller

So any chance to extend it to 32bit by adding the
2 RAMs on the backside? Doubling the RAM wouldn't
be bad either ;) .

Top

Profile

Reply with quote

gunnar

Post subject:

PostPosted: Tue Dec 18, 2007 12:06 pm

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Quote:

I'm not sure about the code not working as fast when given aligned data. Perhaps it is something that could be resolved down to the overhead of checking for misalignment, or branch prediction errors? Or is it quite large?

Its about 13% throughput difference.
But by first reading from the current cache line (LWZ)
and then prefetching the next (DCBT) the performance goes up on the EFIKA for the aligned case.

So the below is faster for the aligned case:
But maybe someone has a better idea?

Code:

.Lloop:                                         /* the main body of the cacheline loop */

        dcbz    r11,r6                          /* clear DST cache line */ 

        lwz     r7, 0x04(r4)                    /* copy using a 4 register stride for best performance on e300 */

        dcbt    r3,r4                           /* SRC cache line prefetch */

        lwz     r8, 0x08(r4)

        lwz     r9, 0x0c(r4)

        lwz     r10,0x10(r4)

        stw     r7, 0x04(r6)

        stw     r8, 0x08(r6)

        stw     r9, 0x0c(r6)

        stw     r10,0x10(r6)

        lwz     r7, 0x14(r4)

        lwz     r8, 0x18(r4)

        lwz     r9, 0x1c(r4)

        lwzu    r10,0x20(r4)

        stw     r7, 0x14(r6)

        stw     r8, 0x18(r6)

        stw     r9, 0x1c(r6)

        stwu    r10,0x20(r6)

        bdnz    .Lloop

[/code]

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 2 of 5

[ 73 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: Bing [Bot] and 9 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum