All times are UTC-06:00




Post new topic  Reply to topic  [ 15 posts ] 
Author Message
PostPosted: Fri Apr 04, 2008 2:18 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi,

I received the 54455 development and here are my first impressions:

Excellent quality of the dev-system
The case is high quality.
The first glance on the board and layout give the same impression. Very quality work. No cheap parts.

Setting up the system was easy
Setting up the system with cross-compiler was easy and simple.
The system comes with good documentation and cross-compiler for MS-Windows on CD.
I choose to download the compiler for Linux.

GCC for Coldfire is working good but has room for improvement
Compiling for Coldfire was easy.
But I did look at the ASM that the GCC was producing, and it was obvious that GCC is a bit challenged to create efficient code for Coldfire.
Some may say this is no news.
C is a high language - you can not expect that the code created by GCC will be small or fast.
This might be true.

The code that GCC produced was not terrible bad.
But it was clear that GCC looses quite some performance.
For example the LINUX memcpy (which is C code for Coldfire) runs at 50% of what the dev-board could do.
If GCC knew how to use a "movem.l" instruction correctly it would be twice as good.

I think it would be a nice tasks to improve GCC for Coldfire.
It was obvious that the created code was both bigger and slower than needed.


The 54455 Coldfire CPU
The 54455 is a V4e Coldfire.
The V4 Coldfire is a very efficient CPU.
The V4 is much more efficient than all previous Coldfire CPUs.
It can execute up to 2 integer instructions plus 1 branch per clock.
TWO+ONE is even better than some Desktop CPUs.

The 54455 is the smallest of the V4e cores.
Its the "embedded" version of the fast embedded CPUs and has only a 16bit data bus.
Because of this the memory performance is a bit limited of course.
All other V4 cores have a 32bit memory interface and if we want to use a Coldfire for a device that needs to have memory performance than all other V4 will have better memory performance.


Memory-Benchmark
One of the things I like to start with is evaluating the memory performance as it has such a big impact on the overall performance of many tasks.
The 54455 is of course not a memory wonder (with 16bit)
And I just started with benchmarking so here are some early results:

CF 54455 (16bit bus)
Read: 126 MB/sec
Write: 196 MB/sec
Copy: 121 MB/sec

For comparison G3-600 MHz (64Bit Bus) under MorphOS
Read: 228 MB/sec
Write: 87 MB/sec
Copy: 116 MB/sec

As mentioned I'm just starting the evaluation.
Much more detailed results will follow.

Cheers


Top
   
PostPosted: Fri Apr 04, 2008 4:03 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
If GCC knew how to use a "movem.l" instruction correctly it would be twice as good.
This is really an instruction for hand-coded assembly and very specific circumstances :)

movem.l will reorder the way it saves the registers and restores them. GCC won't generate it because the actual operation can be unpredictable depending on the register ordering - and instructions that operate on multiple registers are hard to fit into most instruction schedulers (see PowerPC "string" instructions which more often that not reduce performance) especially if you are using the stack to move the data back and forth.

So, it's best to use it by hand with full knowledge of what weirdness it can create, and not try and shoehorn it into gcc. Fixing glibc so it has far more optimized ColdFire routines taking full advantage of the architecture would be a much more noble goal, and take all the guesswork out of which compiler to choose (although, then you also have to port the changes to every other libc you may use :)
Quote:
For comparison G3-600 MHz (64Bit Bus) under MorphOS
Read: 228 MB/sec
Write: 87 MB/sec
Copy: 116 MB/sec
I am curious, is this a Pegasos I or Pegasos II? These seem very, very low.. (the 87MB/s write speed specifically). Is this that weird "new firmwares break my bandwidth" bug?

I'm also curious to know what the real-world code performance is on ColdFire - maybe generating a fractal (I know this is not something most people do in the real world, but it is something we know takes quite a time) or encoding an MP3 or AAC audio clip (something we may all do all day every day)?

_________________
Matt Sealey


Top
   
PostPosted: Fri Apr 04, 2008 9:53 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Quote:
If GCC knew how to use a "movem.l" instruction correctly it would be twice as good.
This is really an instruction for hand-coded assembly and very specific circumstances :)
The point is that Motorola does recommend it for memory copies.
Quote:
movem.l will reorder the way it saves the registers and restores them. GCC won't generate it because the actual operation can be unpredictable depending on the register ordering - and instructions that operate on multiple registers are hard to fit into most instruction schedulers (see PowerPC "string" instructions which more often that not reduce performance) especially if you are using the stack to move the data back and forth.
Can you explain this?
What do you mean by reorder?

Quote:
Fixing glibc so it has far more optimized ColdFire routines taking full advantage of the architecture would be a much more noble goal,
Speeding up memcopy in Linux and glibc is be simple

Quote:
I am curious, is this a Pegasos I or Pegasos II? These seem very, very low..
Its Peg 1.
That the Peg 1 and Peg 2 G3 are very slow is no news, isn't it?

Quote:
Is this that weird "new firmwares break my bandwidth" bug?
Was there a new firmware released for Peg1?

Quote:
I'm also curious to know what the real-world code
I'm doing some real world test right now.
You can do the fractal if you want.

From my current test the Coldfire looks surprisingly fast.
Even the slow 266Mhz-V4e-16bit is a lot faster than UAE (without Jit) on my Dual 2.2-GHz DueCore.


Top
   
PostPosted: Fri Apr 04, 2008 5:40 pm 
Offline

Joined: Tue Jun 14, 2005 8:30 pm
Posts: 78
Location: Germany
Quote:
Excellent quality of the dev-system
The case is high quality.
The first glance on the board and layout give the same impression. Very quality work. No cheap parts.
We want pictures please.

_________________
..:: www.djbase.de ::..


Top
   
PostPosted: Fri Apr 04, 2008 6:25 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Quote:
Excellent quality of the dev-system
The case is high quality.
The first glance on the board and layout give the same impression. Very quality work. No cheap parts.
We want pictures please.
Plenty of pictures at the Freescale website.

_________________
Matt Sealey


Top
   
PostPosted: Fri Apr 04, 2008 6:40 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Quote:
Quote:
If GCC knew how to use a "movem.l" instruction correctly it would be twice as good.
This is really an instruction for hand-coded assembly and very specific circumstances :)
The point is that Motorola does recommend it for memory copies.
There are many different ways of doing memory copies under many different circumstances. Just because you know one fast way to do it doesn't mean you can get a compiler to output code, generically, to do it all the time, without some quirks happening.

gcc's internal inline memcpy probably has serious ABI and scheduler restrictions to meet, and in the real world, you call memcpy() and it does it's thing. The internal gcc stuff is not going to be used as much as libc memcpy().
Quote:
Can you explain this?
What do you mean by reorder?
MOVEM.L D0-D3,-(A7)

This looks like it would store registers D0 thru D3 in that order, to the stack.

In actual fact, on a 680x0 at least, it saves them off in reverse order (D3, D2, D1, D0) which means all your data is in the wrong order.

Since you can come up with some crazy combinations of data/address registers, ranges etc., and I even forget which takes precedence (address or data in the ranges) then this can cause some funny problems with backward stacks or arguments for a function in the wrong registers (but in 3-argument functions, the middle one would be correct!).

It is, of course, an awesome instruction, if used correctly, I just think there is very little scope for autogenerating it.
Quote:
Its Peg 1.
That the Peg 1 and Peg 2 G3 are very slow is no news, isn't it?
The Pegasos II should be getting very similar write speed to the read speed, and at least twice faster than those figures in my experience. Remember the G3 has a theoretical bandwidth of 800MB/s at 100MHz bus, on a 60x bus.

Pegasos I, it can be expected, because the Mai chip and the SDR RAM don't help.

To answer your question, there was never any new firmware for the Pegasos I, but there was an update for the Pegasos II which saw some peoples' benchmarks change - yours included. That is what I was talking about.
Quote:
Even the slow 266Mhz-V4e-16bit is a lot faster than UAE (without Jit) on my Dual 2.2-GHz DueCore.
That's not too surprising considering UAE intends to be cycle-accurate in chipset terms.

_________________
Matt Sealey


Top
   
PostPosted: Sat Apr 05, 2008 2:15 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
The internal gcc stuff is not going to be used as much as libc memcpy()
We have room for improvement on Coldfire.
The glibc is slow too.
Quote:
Quote:
Can you explain this?
What do you mean by reorder?
MOVEM.L D0-D3,-(A7)
This looks like it would store registers D0 thru D3 in that order, to the stack.
The -(an) addressing mode has of course to work that ways.
Stack operation is LIFO (last in first out)

BTW it was always documented that it works this way:
If the effective address is specified by the predecrement mode, The order of storing is from address register 7 to address register 0, then from data register 7 to data register 0.

BTW you would never use -(An) addressing mode for memcpy on CF but always the (An) addressing mode.

Quote:
Quote:
Its Peg 1.
That the Peg 1 and Peg 2 G3 are very slow is no news, isn't it?
The Pegasos II should be getting very similar write speed to the read speed, and at least twice faster than those figures in my experience. Remember the G3 has a theoretical bandwidth of 800MB/s at 100MHz bus, on a 60x bus.
I think theory and praxis are sometimes two different things. :-)

On MorphOS even the 1000 MHz G4 on Peg II
was hardly getting over 200 MB/sec memory read performance on a 64bit bus.

I would be interesting to know how the Coldfire will perform when connected to 32bit memory.
Can a 266 Mhz Coldfire with 32bit memory reach the same memory performance as the Peg2/G4 under MOS ?
Quote:
Quote:
Even the slow 266Mhz-V4e-16bit is a lot faster than UAE (without Jit) on my Dual 2.2-GHz DueCore.
That's not too surprising considering UAE intends to be cycle-accurate in chipset terms.
No no, I was not referring to Chipset emulation but to 68k-CPU emulation which was set to "run fast as possible" mode. :-)


Top
   
 Post subject:
PostPosted: Sat Apr 05, 2008 7:08 am 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
BTW, have you noticed the FPGA on the board yet...;-) You can access it from the set of pins on the top center of the board.

R&B :)

_________________
http://bbrv.blogspot.com


Top
   
PostPosted: Sat Apr 05, 2008 7:35 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
The -(an) addressing mode has of course to work that ways.
Well, it doesn't, why does it matter what the register order is to a stack dump of registers? Why can't you load registers in any order?

It's not exactly intuitive and it certainly does not act the same as multiple move.l instructions - sometimes you cannot just replace 4 move.l with a movem.l without reordering your data in the first place.
Quote:
Stack operation is LIFO (last in first out)
There is no ABI specification that says that register dumping to stack has to be in any particular order :)

The point is that several move.l instructions would work and consolidating them to movem.l without a modest restructuring of code doesn't work in some - some might say many - circumstances. For single instruction store/restore of stack, it works and I am sure it's generated in this circumstance.

For a memory copy of data from one place to another, with register availability, instruction scheduling, complexities of which post/pre/increment/decrement mode and the addressing modes in use and their impact on performance have to be weighed by the compiler code generation subsystem. It may simply not generate better code - like using the String subset on PowerPC, or even every platform in GCC 4 would be a great example of a better code generator doing a bad job.
Quote:
BTW you would never use -(An) addressing mode for memcpy on CF but always the (An) addressing mode.
Why always the focus on memory copies? :D
Quote:
On MorphOS even the 1000 MHz G4 on Peg II
was hardly getting over 200 MB/sec memory read performance on a 64bit bus.
You yourself have benchmarked higher.

I don't think memory bandwidth is the be-all and end-all of system optimization. In the end, more likely to improve overall system performance especially on low-speed CPUs is DMA offload, and calculation offload (for instance TCP/IP checksumming or any kind of encryption). The ColdFire chip has plenty of DMA provision. I would suggest perhaps implementing a Linux DMA Engine (drivers/dma - like IOAT and Freescale's PowerQUICC) and seeing how this improves performance.

I'm looking into the same thing for MPC5200B right now, we might finally use BestComm for something :)

Remember one of the benefits of the original Amiga was the DMA access independant of the 68000 CPU. Since the 68000 only got to access the bus every other clock, it was important to let the Blitter, Paula audio etc. be able to do their work without having to be in constant coordination with the CPU.

It would be much better to hand off the CPU to do some other work like calculating a fractal, and also preserve the performance of I/O peripherals by not having them wait on the CPU which is busy doing memory copies.
Quote:
Even the slow 266Mhz-V4e-16bit is a lot faster than UAE (without Jit) on my Dual 2.2-GHz DueCore.
That's not too surprising considering UAE intends to be cycle-accurate in chipset terms.[/quote]

No no, I was not referring to Chipset emulation but to 68k-CPU emulation which was set to "run fast as possible" mode. :-)[/quote]

However memory access for chip-ram might still be restricted; the entire emulation is focussed on keeping a cycle-accurate Amiga chipset in there, regardless of the CPU emulation. This includes timers like EClock, CIA emulation, Paula interrupts etc. - I am not sure how much this would affect the system in a pretend RTG driver with no other chipset access, but remember "fast as possible" emulation is a concession to the Amiga Forever crowd and not the original Amiga gaming roots of UAE. What was your UAE configuration? How much fast-ram and what OS did you run and what CPU are you emulating? :)

God is in the details.. :D

Perhaps it would be better to run a more proficient emulator, which can run a similar OS - unless you are running Linux under UAE (that is a scary though) as well as on the ColdFire board, it may not be a fair test (it may favour the Amiga emulation, if Linux really is as bad on Coldfire as you say :)

_________________
Matt Sealey


Top
   
PostPosted: Wed Apr 09, 2008 12:05 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Quote:
The -(an) addressing mode has of course to work that ways.
Well, it doesn't, why does it matter what the register order is to a stack dump of registers? Why can't you load registers in any order?
Of course does the order in which the register are put on the stack matter.
Please mind that there are three ways of putting variables on the stack.

A) Single moves.
B) C-style with link and framepointer.
C) Without framepointer

The CPU is designed to produce the correct result independent on the of the three ways you choose.
The 68k ensures that the stack access is always working correctly.

Example 1:
Single moves
Code:
move.l D3,-(SP)
move.l D2,-(SP)
move.l D1,-(SP)
move.l D0,-(SP)
...
move.l (SP)+,DO
move.l (SP)+,D1
move.l (SP)+,D2
move.l (SP)+,D3
is the same as:

Example2:
Link and movem
Code:
link FP,#-16
movem.l D0-D3,(SP)
...
movem.l (SP),D0-D3
unlink
And the same as:
just movem
Code:
movem.l D0-D3,-(SP)
...
movem.l (SP)+,D0-D3
Quote:
It's not exactly intuitive and it certainly does not act the same as multiple move.l instructions - sometimes you cannot just replace 4 move.l with a movem.l without reordering your data in the first place.
The CPU behaves 100% correct.
This is the concept of how the stack works.

I can understand that if you just look at movem.l without considering the concept of the stack this might look unintuitive at first for you.

Quote:
Quote:
On MorphOS even the 1000 MHz G4 on Peg II
was hardly getting over 200 MB/sec memory read performance on a 64bit bus.
You yourself have benchmarked higher.
No, I did not.
On MorphOS the G4-Peg2 has a maximum memory read performance
of about 220-260 MB/sec depending on your memory and firmware versions.

On Linux and using ASM Cache prefetch tricks you could get the memory read performance to close to 700 MB/sec.

As the ASM cache instructions do NOT work on MorphOS you can not reach this value on MorphOS.

If you write C code (without ASM cache instruction) your performance is limited to 260 MB both on Linux and MorphOS.


Quote:
I don't think memory bandwidth is the be-all and end-all of system optimization.
Memory performance is a very good indicator for system speed.
One of the limiting factors for running "bigger" applications is the memory performance. Its not necessarily the copy performance, but the speed in which your CPU can read or write data is very important.

Especially for running Linux applications (which are often bigger) the memory performance is very important.


Quote:
Remember one of the benefits of the original Amiga was the DMA access independant of the 68000 CPU. Since the 68000 only got to access the bus every other clock, it was important to let the Blitter, Paula audio etc. be able to do their work without having to be in constant coordination with the CPU.

It would be much better to hand off the CPU to do some other work like calculating a fractal, and also preserve the performance of I/O peripherals by not having them wait on the CPU which is busy doing memory copies.
I fully agree with you.
The major strength of the AMIGA design was the possibility to off-load jobs to other engines. The key was that you could do this off-loading on AMIGA quickly and for a very low overhead.

The SuperAGA chipset revives this concept again.
Powerful chipset which can be fully used for a very low overhead.
Quote:
Quote:
Quote:
That's not too surprising considering UAE intends to be cycle-accurate in chipset terms.
No no, I was not referring to Chipset emulation but to 68k-CPU emulation which was set to "run fast as possible" mode. :-)
However memory access for chip-ram might still be restricted; the entire emulation is focussed on keeping a cycle-accurate Amiga chipset in there, regardless of the CPU emulation.
Don't worry, there are no chip ram accesses involved in this test.

I believe that the test is very good to show
max JIT performance of the used systems.



Quote:
Perhaps it would be better to run a more proficient emulator, which can run a similar OS - unless you are running Linux under UAE (that is a scary though) as well as on the ColdFire board, it may not be a fair test (it may favour the Amiga emulation, if Linux really is as bad on Coldfire as you say :)
The test runs fully independent of the OS.
And as there are no OS calls during the test.
The performance of the underlying OS is not important for the test results with one exception only.
The exception is the memory latency benchmark.
Enabling the MMU has a noticeable negative impact on the memory latency results of course.


The test gives a very good indication of the performance of the different emulation (JIT on x86, JIT on PPC, CF native). I'll summary the result for you and I'll post the test source later.



Cheers
Gunnar


Top
   
PostPosted: Wed Apr 09, 2008 6:16 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
Quote:
Well, it doesn't, why does it matter what the register order is to a stack dump of registers? Why can't you load registers in any order?
Of course does the order in which the register are put on the stack matter.
I think you misunderstand me here. I of course know how a stack works. You did notice I write a lot of Forth code :D

movem.l is great for stack usage, where you just load or store a bunch of registers in a certain predicated order to the stack with a decrement or copying memory from lowest address to highest address with an increment.

However if you want to move registers in a RANDOM order - perhaps you have no control over the specifics of the data or need to swap things around - then there is no benefit to movem.l. Unless you want to throw 4 or 5 registers, in reverse order, to a certain, steadily inc/decrementing location, movem.l really doesn't optimize anything at all.

I would say that unless you are dealing with stack-like structures or straight memory copies, it's usefulness sort of runs out. Especially if you are dealing with C code, if you load data into 4 "int"s, you have no idea which registers these will use. So copying them all to the stack in a certain order using movem.l is not an optimization, since you do not know the order of your input data.

It's impossible to reliably specify in C/C++, even with pragmas and attributes, the register allocation to restrict it to an order favorable to movem.l - in this situation it would be detrimental to register allocation and may reduce the performance of all the code around it - and how would you automatically determine when register allocation SHOULD favor movem.l for a certain function?

My point is, simply noticing that GCC does not generate an instruction where you would consider it a better usage, is not perhaps a flaw in GCC, but a choice of the language you used to describe it - C, rather than assembly. Since the complexity of determining whether data should be kept in order for movem.l optimization is.. well, complex, and probably far beyond the dependency generation and register transfer languages compilers use.

GCC would probably never output move16 on a 68040+ either; because it's actual use is really quite hard to squeeze in to most generic code. You will never see GCC generate a ror.l either, because the C language does not have any instrinsic to describe it. This is not a flaw in the compiler and does not need "fixing". Lucky for you, neither of these instructions exist on ColdFire. However that does not mean that every instruction in ColdFire is something which can be scheduled by a C compiler in larg quantities of code.

_________________
Matt Sealey


Top
   
PostPosted: Wed Apr 09, 2008 6:22 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Second reply to get to this curiosity;
Quote:
On MorphOS the G4-Peg2 has a maximum memory read performance of about 220-260 MB/sec depending on your memory and firmware versions.

On Linux and using ASM Cache prefetch tricks you could get the memory read performance to close to 700 MB/sec.

As the ASM cache instructions do NOT work on MorphOS you can not reach this value on MorphOS.
Why don't the cache instructions work on MorphOS? dcbt/dcbz surely aren't "killed" by something inside the OS?

_________________
Matt Sealey


Top
   
PostPosted: Wed Apr 09, 2008 6:51 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
movem.l is great for stack usage, where you just load or store a bunch of registers in a certain predicated order to the stack with a decrement or copying memory from lowest address to highest address with an increment.
As MOVEM is designed for this we can agree MOVEM works as designed.


Last edited by gunnar on Wed Apr 09, 2008 8:39 am, edited 1 time in total.

Top
   
PostPosted: Wed Apr 09, 2008 6:57 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Quote:
Second reply to get to this curiosity;

Why don't the cache instructions work on MorphOS? dcbt/dcbz surely aren't "killed" by something inside the OS?
Interesting question. Maybe the MorphOS-Team can come back with the answer.


Top
   
PostPosted: Wed Apr 09, 2008 10:28 am 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
As MOVEM is designed for this we can agree MOVEM works as designed.
Can we also agree that just because it works as designed, that does not mean that it is designed to work everywhere for every situation? Even the ones you mentioned, and specifically for generated (compiled) code?

_________________
Matt Sealey


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 15 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group