Power Developer https://powerdeveloper.org/forums/ |
|
benchmark results & questions https://powerdeveloper.org/forums/viewtopic.php?f=61&t=1532 |
Page 1 of 1 |
Author: | gunnar [ Thu Apr 10, 2008 6:05 am ] |
Post subject: | benchmark results & questions |
Hello, We did general benchmarking of the V4m 54455 dev board to get an better understanding of the overall performance of the V4m. The results give some information but at the same time they lead to some more questions too. Here are the results: General System Benchmark result Code:
-------------------------------------------
Those interested can find the source of the bench here:Processor & Memory Performance Bench v4.20 ------------------------------------------- Stop all program before the test. Do not use the computer during the test. The test will run some minutes, please be patient. Total memory required = 4.2 MB. Calibration loops: 8 ------------------------------------------- Comparing different CPU functions: Results are in million instructions per sec. Higher value is faster. CPU-Benchmark 2MB 16KB 4KB 1KB ------------------------------------------- addi 251.3 251.3 251.3 251.3 shift 293.2 293.2 251.3 251.3 mix 439.8 439.8 439.8 439.8 mul 67.7 67.7 67.7 65.2 bra-un 41.9 42.9 40.9 41.9 bra-pre 117.3 117.3 117.3 109.9 bsr 13.3 13.4 13.3 13.2 nop 45.1 44.0 44.0 44.0 ------------------------------------------- Measuring memory latency: Result is Million random accesses per sec. Higher value is faster. Memory Latency 2MB 16KB 4KB 1KB ------------------------------------------- random read 1.0 ------------------------------------------- Measuring memory throughput: Results are in MB/sec. Higher value is faster. Memory 2 Memory Alignment 0-0 2MB 16KB 4KB 1KB ------------------------------------------- glibc memcpy 67.7 67.7 66.4 56.7 read 8 81.8 74.9 74.9 69.0 read 16 95.1 95.1 95.1 88.0 read 32 121.3 121.3 117.3 109.9 read 32x4 121.3 121.3 121.3 106.6 read 32x4B 121.3 121.3 117.3 106.6 write 8 13.1 13.2 13.2 13.0 write 16 26.1 26.1 26.1 25.5 write 32 51.0 51.0 51.0 48.9 write 32x4 51.7 51.0 51.0 48.9 write 32x4B 185.2 185.2 175.9 153.0 copy 8 23.3 23.3 23.3 22.0 copy 32 67.7 67.7 67.7 57.7 copy 32x4 65.2 65.2 65.2 55.8 copy 32x4B 117.3 121.3 117.3 90.2 ------------------------------------------- Cache 2 Cache Alignment 0-0 2MB 16KB 4KB 1KB ------------------------------------------- glibc memcpy 67.7 106.6 106.6 103.5 read 8 74.9 92.6 140.7 135.3 read 16 92.6 125.7 219.9 219.9 read 32 121.3 175.9 586.4 502.6 read 32x4 121.3 185.2 703.7 703.7 read 32x4B 121.3 185.2 879.6 879.6 write 8 13.2 13.3 13.3 13.3 write 16 26.1 26.7 26.7 26.7 write 32 51.0 53.3 53.3 52.5 write 32x4 51.0 53.3 53.3 53.3 write 32x4B 185.2 207.0 207.0 207.0 copy 8 23.1 23.8 26.7 26.7 copy 32 67.7 100.5 106.6 103.5 copy 32x4 65.2 103.5 106.6 103.5 copy 32x4B 117.3 439.8 390.9 390.9 ------------------------------------------- bench.c bench68k_test.s Linux CF Executable:benchcf Quick analyzes: addi and shift show that the V4m can in general issue one integer instruction per clock. mix shows that under some circumstance the V4 is able to execute 2 integer instructions per clock. Mul shows us that the V4 needs 4 clocks for a normal integer multiplication. bra-un and bra-pre show us that corrected predicted branches are quite fast. Overall the V4m seems to be a nice embedded CPU. On the first glance, clock by clock the instruction unit of the V4 is more powerful than a 68040. Latency Random read = 1.0 260 clocks for one random memory read seems quite slow to me. I wonder where this latency is caused? Memory throughput: The read performance looks good to me. The write performance is very low. It looks suspicious that the write performance when working on small cache-able blocks is the same as the normal memory write performance. Normally the CPU should get much faster in this test. I wonder how the cache is setup. Could it be that the cache is not running in "copy back" but in "write through" mode? Does someone know in which mode it is? Could someone explain the reason why to use "write throught" ? Cheers Gunnar |
Author: | weiljw [ Thu Apr 10, 2008 6:46 am ] |
Post subject: | |
Gunnar, I haven't looked at the code yet, but in the bottom part of your post you are asking about cache setup. What init code did you use? What environment are you running these tests? In linux? or baremetal? -JWW |
Author: | weiljw [ Thu Apr 10, 2008 6:50 am ] |
Post subject: | |
Just tried to grab the files...Links do not appear to work. FYI -JWW |
Author: | gunnar [ Thu Apr 10, 2008 8:04 am ] |
Post subject: | |
Quote: Just tried to grab the files...Links do not appear to work. FYI
Sorry, my fault typo in url.-JWW I have corrected the links and I have added a download link to the compiled Linux CF executable. I've executed the test on the Linux that is installed on the CF dev-boards. John, many thanks for looking into this! I'm very curious to understand the setup of the memory and cache here. Cheers Gunnar |
Author: | gunnar [ Thu Apr 10, 2008 8:55 am ] |
Post subject: | |
Understanding the Coldfire Cache Thinking about the data cache of the CF brings me to another question. The V4 handbook describes Copyback-Cache behavior as follow: Quote: CFV4ebook - 8.7.4.3 Copyback Mode (Data Cache Only) ... If a byte, word, longword, or line write access misses in the cache, the required cache line is read from memory, thereby updating the cache. ... Kind regards, Gunnar |
Author: | Neko [ Sat Apr 12, 2008 4:20 am ] |
Post subject: | |
Quote: It says that even a line write access which misses in the cache, the required cache line is read from memory. Is this a typing error in the manual, or does it mean that a CF will fetch a line from memory even if its aware that it will overwrite it complete in the next step? Or, am I just misunderstanding this?
As I understand it, a line write access may be made even if only a longword of that cache line changed. In the event that the other 3 longwords in the cache line have changed, the cache subsystem has to merge the two together in order to stay coherent.It's all down to the way most caches work; they are rarely designed to load or store less than a cache line (this is the whole point of organising it in 16, 32 or 64 byte chunks in the first place). Also a processor usually has no idea what real memory is; everything is interfaced behind at least one level of cache (and turning the caches off simply means every cache access is a miss :) and some kind of platform bus - so for the processor to ever work on memory at all, it needs to make sure the data in the cache matches that in real memory, at any cost, simply for coherency's sake. |
Author: | gunnar [ Sat Apr 12, 2008 6:45 am ] |
Post subject: | |
Quote:
As I understand it, a line write access may be made even if only a longword of that cache line changed.
The point is that the Coldfire V4 can distinguish byte/word/longword writes from full line write.The Coldfire has in optimization for "Full line writes" when using write through cache mode. A "Full Line write" is a write that will overwrite all 16 bytes of the full cache line anyway. To create a full line write Freescale recommends to use the MOVEM instruction. As far as I understands the Coldfire cache does operates like this: Write to memory address in Write through mode: ---------------------------------------------- (In write through the data will be written to the memory directly - Data will NOT be fetched to on chip cache first) Byte write (direct byte memory write of this data) Word Write (direct word memory write of this data) Long word write (direct longword memory write of this data) line write: (line write are generated by MOVEM instruction) In the line write the 16byte will be bursted out. If you run with DDR2 memory than any access that does not burst is rather slow. Some clarification on how the COLDFIRE DDR2 memory interface works would be appreciated. Its clear that using line writes does highly improver performance. Write to memory address in COPY BACK mode: ---------------------------------------------- Copy back does always burst a whole cache line in and out. So if you alter one byte in memory which is not yet cached then the cache line gets bursted in and altered in on chip cache. The CPU will burst out the content of the altered cache line, if its need the cache line for something else. COPY BACK is in most case MUCH more efficient than write through. I'm a bit puzzled the Linux operates not in COPY BACK mode. It would be nice to learn if there is reason for this. The question that I had, is how does the CPU handle it when it recognizes a "full line write" as for example created by a MOVEM instruction. Reading in the line complete to then complete overwrite it is of course not efficient. I would say this could be reagrded as a bug. The question that we have is, is this a misprint or misunderstanding of the manual or does the V4 has a deficiency here. |
Author: | Neko [ Sat Apr 12, 2008 7:19 am ] |
Post subject: | |
Quote: (In write through the data will be written to the memory directly - Data will NOT be fetched to on chip cache first)
Right, but the data in the cache is always valid at this point, during a write-through cache operation. The processor writes it to the cache and then the cache subsystem immediately pushes it back out to memory.Copy-back may wait. In the event that it waits and in the meantime data has been changed, it *must* refresh the cache line with the contents of memory in order to maintain coherency. The cache will prioritize absolutely the coherency of data in the cache (since all memory access, read or write, has to live in the cache at some time) but it may not be too concerned with the coherency of data in memory. The relevant snippet (because it's pretty much the same thing) from the PowerPC 32-bit Programming Environments manual; Quote: 5.2.4.1.1 Pages Designated as Write-Through
The actual operation really is irrelevant, I think the PowerPC explanation is a lot clearer though.When a page is designated as write-through, store operations update the data in the cache and also update the data in main memory. The processor writes to the cache and through to main memory. Load operations use the data in the cache, if it is present. In write-back mode, the processor is required only to update data in the cache. The processor may (but is not required to) update main memory. Load and store operations use the data in the cache, if it is present. The data in main memory does not necessarily stay consistent with that same location’s data in the cache. Many implementations automatically update main memory in response to a memory access by another device (for example, a snoop hit). In addition, dcbst and dcbf can explicitly force an update of main memory. The write-through attribute is meaningless for locations designated as caching-inhibited. If it's a bug to do this, then this method of cache management has been broken for 18 years in the m68k line. Do you really think this is true? REALLY? |
Author: | gunnar [ Sat Apr 12, 2008 7:37 am ] |
Post subject: | |
Neko, But please do NOT mix PowerPC with CF here. Coldfire and PowerPC is not the same! The behavior of "write trough" on the CF is different to your post. Please be so kind and lets refer to CF information in this discussion to prevent causing confusion. Quote:
If it's a bug to do this, then this method of cache management has been broken for 18 years in the m68k line. Do you really think this is true? REALLY?
Yes.To avoid his behavior Motorola added the MOVE16 instruction to the 68k instruction set. The CF does not support the MOVE16 anymore. But the Coldfire compensates for this by supporting BURST recognition on the MOVEM instruction. I think that lost MOVE16 is not a loss as the MOVEM is more powerful on the Coldfire now. My question only is if this condition is handled by the movem of the Coldfire correctly or if this is an oversight in the current V4. |
Author: | Neko [ Sat Apr 12, 2008 10:18 am ] |
Post subject: | |
Quote: But please do NOT mix PowerPC with CF here. Coldfire and PowerPC is not the same!
A cache is a cache is a cache. The basic operation of a write-through cache and a write-back cache is something they invented decades ago. At it's not changed. Write-back caches have ALWAYS had this caveat of requiring a little more bandwidth - on the basis that this operation is actually rare enough compared to the massive speed gains of not having to write-through, it speeds everything up.The actual implementation as logic may be slightly different but the fundamental operation - something you might see taught in Computer Science courses - is pretty much identical unless you want to study in-depth the benefits of the many different ways you can implement coherence protocols. At the high level most people work at, it is identical. Try not to get too close to the metal. It won't help your code. Quote: My question only is if this condition is handled by the movem of the Coldfire correctly or if this is an oversight in the current V4.
Does this really, really impact your plans to emulate the 68000 opcodes not supported on ColdFire?I think there is far more to do than test memory bandwidth and look at memory bandwidth and niggle over cache handling. I really do not think that the figures you got from the simple benchmark are that bad. They are certainly far higher than the ones you would get on an original m68k processor. I think this shows you have a lot of room to soak up any overheads. Wouldn't it be good to start trying to emulate a certain set of opcodes now? Perhaps, and I am not joking, movem.l is a good start, try implementing a handler which reimplements the extra addressing modes (decrements etc.) and see how much performance you can get out of it. |
Author: | Tcheko [ Sat Apr 12, 2008 11:15 am ] |
Post subject: | |
Quote: Quote: My question only is if this condition is handled by the movem of the Coldfire correctly or if this is an oversight in the current V4.
Does this really, really impact your plans to emulate the 68000 opcodes not supported on ColdFire?I think there is far more to do than test memory bandwidth and look at memory bandwidth and niggle over cache handling. I really do not think that the figures you got from the simple benchmark are that bad. They are certainly far higher than the ones you would get on an original m68k processor. I think this shows you have a lot of room to soak up any overheads. Wouldn't it be good to start trying to emulate a certain set of opcodes now? Perhaps, and I am not joking, movem.l is a good start, try implementing a handler which reimplements the extra addressing modes (decrements etc.) and see how much performance you can get out of it. Did you had a look at this ? http://www.microapl.co.uk/Porting/ColdF ... 8KLib.html Czk. |
Author: | jcmarcos [ Mon Apr 14, 2008 3:27 am ] |
Post subject: | |
Matt, Gunnar, although you might think that your discussion is getting hard, I like it a lot. I struggle to understand most things, but this is top-notch technical debate! |
Author: | gunnar [ Mon Apr 14, 2008 4:53 am ] |
Post subject: | |
Quote: I struggle to understand most things...
Hi Juan,please ask if you have any questions. |
Page 1 of 1 | All times are UTC-06:00 |
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/ |