i.MX515 Project
Profiling the i.MX515 GL stack

in category Graphics & 3D
proposed by blu on 30th August 2010 (accepted on 1st September 2010)
Project Summary

Profiling the i.MX515 GL stack to find weak points and optimization potential with regard to NTSH-JASS and other application support. Evaluation of the prospects for further developing of commercial-grade game content-creation tools and pipeline components.


My firm belief is that the current generation of SoC silicon utilized in the handheld, nettop, smartbook and other low-power embedded markets hosts potent GPUs. The i.MX515 in particular has a state-of-the art GPU in the face of the AMD Z430 "Yamato DX". The practical deliverables of this project are:

  • to identify bottlenecks and points of contention in the software GL stack
  • to provide OSS codebase suitable for quick evaluation and/or facilitation of the advanced features in the GL software stack
  • to demonstrate that the Z430 is at least on par with other popular SoC GPUs of the same timeframe (e.g. PowerVR SGX 53x)
  • to provide a sample content-creation graphics tool - a GLSL shader editor/compositor (final stage).

Project Blog Entries

  A picture is worth a thousand pixels
posted by blu on 7th April 2011

I have some news from the graphics trenches, but as I am under fire I will be very brief and use a picture instead.

Original Flickr location

What is seen on the shot above is Maverick Meerkat, more-or-less stock EfikaMX edition, with a more-or-less stock 31.14.20-efikamx kernel from the gitorious repo, running two EGL windows side-by-side, one hosting an ES2 context (left) and the other a VG context (right), respectively. What can be guessed from the picture, is that the left window is entirely a product of the z430, and the right - of the z160, both GPUs found in the imx515, and hence in my netbook (which produced the screenshot today). What is definitely not possible to tell from the picture, is the fps measures of each window - ~12 fps for the bumpy ES2 loop, and ~15 fps for the feline VG loop. Now, while I cannot claim to know the theoretical ceiling of the VG loop on the z160, I have a good idea about the one of the bumpy ball, and it's ~12.5 fps of rendition alone, assuming zero swap-time, as per the z430 (app is effectively a special-case shader torture test for SoC GPUs). In other words, the ES2 window gives a sustained fps close to its _zero_swap_time_ ceiling. While the VG window draws a few kilobytes worth of vector paths at 15fps (I'm not kidding about the size of the assets, check the tiger demo sources).

Or, IOW, what we have here are two EGL contexts, of 3D and 2D nature, respectively, outputting simultaneously at some fairly low swapbuffer overheads.

Here the alert reader would say, 'Wait, while the stock imx EXA driver shipping currently with the Maverick EfikaMX is no slouch itself when it comes to visualizing ES contexts, it fails at running a windowed VG context. So what the heck is that screenshot showing?..'

/martin out

ps: sorry about the apparent screen tearing seen on the shot, but it's rather difficult to grab steady pictures of volatile-content windows under a stacking X wm.

pps: I was recently promoted in rank, from a PowerDeveloper to a full-time position at Genesi. Imagine the firepower I have at my disposal now ; )
  It's a bird! It's a plane! No, it's a pixel!
posted by blu on 17th October 2010

In the world of computer graphics GPUs are these little black boxes that we always talk to over layers and layers of software abstraction. In this regard GPUs are nothing like CPUs - the latter are exposed down to their clockwork - instructions, clocks, and even errata. No so with GPUs - they're locker boxes keeping wealths of arcane IP (intellectual property). In this regard, getting close and intimate with a GPU is not unlike reading a black box.

So it's a bit of fortune's smile in the world of GPUs (let alone SoC GPUs) when we get something like the famous (or notorious, depending how you look at it) AMD_performance_monitor extension in an actually usable form. Or a form close to usable, anyway.

AMD_performance_monitor from the GL stack of our Z430 offers some 807 performance counters, grouped in 14 groups, by category, or sub-system in the GPU. A great deal of those counters carry little useful information to anybody but a driver programer, yet select few of the remaining counters can be of great use to us - for profiling purposes, or plain black-box reading. Unfortunately for us, we also have to figure out which those registers are. For that we have to rely on the string names of the individual counters and their respective groups.

Here I'll take the opportunity to express my gratitude to the authors of the current iMX515 GL stack, for not deliberately obfuscating the names of the AMD_performance_monitor counters. *shakes fist in the general direction of a few desktop GL stacks*.

Ok, enough ado. Let's get to business.

The last, 14th group of performance counters available in the 2010.07.11 stack is dubbed 'RB_SAMPLES', which stands for 'Render-Block Samples'. Here's its human-readable query:

group 13: RB_SAMPLES, id: 13
number of counters: 4
max number of active counters: 4
counter 0: RB_TOTAL_SAMPLES, id: 0, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
counter 1: RB_ZPASS_SAMPLES, id: 1, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
counter 2: RB_ZFAIL_SAMPLES, id: 2, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
counter 3: RB_SFAIL_SAMPLES, id: 3, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]

Remember last time when we meant to count some pixels for the sake of measuring fillrate workloads, and we had to use basic shapes to make their pixel area easily computable? Well, the above counters give us the exact pixel stats from the POV of the GPU. Hello there, definitive pixel stats!

So, in order of appearance:

  • RB_TOTAL_SAMPLES: pixels output by the rasterizer block, for the duration of the monitor session
  • RB_ZPASS_SAMPLES: of the above, pixels that passed the depth-test
  • RB_ZFAIL_SAMPLES: the complement of the above
  • RB_SFAIL_SAMPLES: pixels that failed the stencil test

In a scenario where we have both depth- and stencil testing in place, our visible pixels would be RB_ZPASS_SAMPLES - RB_SFAIL_SAMPLES. Without stencil testing, our visible pixel would be RB_ZPASS_SAMPLES, and the sum of RB_ZPASS_SAMPLES and RB_ZFAIL_SAMPLES should be equal to RB_TOTAL_SAMPLES, while RB_SFAIL_SAMPLES should stay at constant zero.

So, let's see what our new favorite pixel-counting facility would show under some basic scenarios. First, the little draw-call analysis primer from last time - the one that does N full-viewport draws per frame, in a viewport of, say, 256x256:

tracing AMD_performance_monitor counters:
group id: 13, counter id: 0
group id: 13, counter id: 1
group id: 13, counter id: 2
group id: 13, counter id: 3
total frames rendered: 1000
draw calls per frame: 4
traced AMD_performance_monitor counters:
group id: 13, counter id: 0, value: 262144000
group id: 13, counter id: 1, value: 262144000
group id: 13, counter id: 2, value: 0
group id: 13, counter id: 3, value: 0

What do you know - it works as advertised: 256^2 * 4 * 1000 = 262144000.

Let's now try something a tad more complex - our original complex-shader iMX515 test - the one that draws a rotating sphere with some fancy and overly-expensive bump mapping:

tracing AMD_performance_monitor counters:
group id: 13, counter id: 0
group id: 13, counter id: 1
group id: 13, counter id: 2
group id: 13, counter id: 3
total frames rendered: 100
traced AMD_performance_monitor counters:
group id: 13, counter id: 0, value: 46766800
group id: 13, counter id: 1, value: 46766800
group id: 13, counter id: 2, value: 0
group id: 13, counter id: 3, value: 0

For a viewport of 512x512, an orthographically-projected unit-radius sphere spanning NDC space should have a pixel area of R^2 * Pi = 256^2 * Pi = 205887 pixels, over 100 frames - 20588700 pixels. Even if we consider that's an approximate number due to the non-trivial shape and the employed edge rasterization rules, that's still 2.27 times less than what our pixel counter reports. Something is well off here. But how come the counter works down to the single pixel in the first test, and is so off in this second test? The differences between those two tests are negligible outside of the shaders - both primers do neither depth- nor stencil-testing. But one of them also does not clear the color buffer (the viewport overdraw test), while the spinning-sphere one clears its color buffer due to the pursued animation effect. Could it be?.. Hmm.

For a viewport of 512x512, a clear-viewport operation covers exactly 262144 pixels. If we add those to our expected sphere area, we get 468031, over 100 frames - 46803100 pixels. That's suspiciously close to what the performance counter reports as total pixels coming out of the rasterizer (and passing depth-testing) - 46766800. Time to return to our easy-to-calculate-pixel-area viewport overdraw test.

Let's tweak it to draw 1/4 of the 512x512 viewport, but this time clear the color buffer too.

tracing AMD_performance_monitor counters:
group id: 13, counter id: 0
group id: 13, counter id: 1
group id: 13, counter id: 2
group id: 13, counter id: 3
total frames rendered: 1000
draw calls per frame: 4
traced AMD_performance_monitor counters:
group id: 13, counter id: 0, value: 524288000
group id: 13, counter id: 1, value: 524288000
group id: 13, counter id: 2, value: 0
group id: 13, counter id: 3, value: 0

Surprise surprise. 512x512 + 256x256 * 4 = 524288 pixels, over 1000 frames - 524288000 pixels.

So what did we just observe? That the pixel counters from the RB_SAMPLES group also count our clear-buffer operation toward the total count, and also toward the depth-test-passed samples.

Wild, unsubstantiated speculation:

Z430 does not have dedicated buffer-clear logic. I would go even further and suggest that the part does not have dedicated ROPs either, but everything is done in the shader ALUs. Something that is not unheard of among mobile SoC GPUs.

/wild, unsubstantiated speculation

So, did our superficial touch to the AMD_performance_monitor extension on the iMX515 pay off? I'd say yes. Imagine what it would be if we actually knew how to use even a tenth of the remaining 803 performance counters ; )

For example code on using the AMD_performance_monitor extension, an exhaustive query of all counters, and perhaps a stepping stone for running some tests on your own - see the test-es primer.
  Draw calls: the cost of living
posted by blu on 12th October 2010

Usually the first thing people try to keep in check in any draw scenario is the number of draw calls. And for a good reason - draw calls are the dirty, CPU-side plumbing of every beautiful GPU pipeline. They are the 'CPU wildcard' factor in any graphics pipeline timing statistic. That is even more true for small embedded systems, where CPU performance is not abound.

So how expensive exactly are the draw calls on our iMX515, with GL ES software stack dated 2010.07.11, and kernel from gitorious? We are about to see next.

For the purpose we need a sound methodology. Let's devise one:

1. Choose what kind of GPU work we are going to measure - per-pixel, per-vertex, or something else. We here choose per-pixel, for reasons that will become obvious below.
2. Draw something of easily-calculable pixel area. We choose a full-screen rectangle.
3. Draw the same thing in a designated 'discard' mode, where the GPU pixel work is brought to negligible or none. For instance, cull the original primitive by inverting the polygon winding - that qualifies as 'none', and this is what we will do.

The time difference between (2) and (3) is then our actual GPU workload. Everything else outside of this time, are costs likely associated with the CPU's dirty job of carrying out our draw calls.

So, let's do a test run first. Let's start with a relatively dense vertex grid for our pixel rectangle, a screen overdraw factor of 1 (i.e. we cover the screen active area just once), and use a single draw call per frame. Additionally, for purity of the test, we will use a 'pass-through' shader (pass vertex coords unmodified, output a fixed color), and a frame-skip ratio of 1:255 (eglSwapBuffers vs. glFinish). So:

  • viewport of 512x512 pixels
  • viewport-spanning grid mesh (we choose indexed triangle list)

    • number of vertices: 2145
    • number of indices: 12288 (that's 4,096 triangles - two per grid cell, on a grid of 32x64)

  • elapsed time for 1000 frames, drawing ON: 6.4s
  • elapsed time for 1000 frames, drawing OFF (culling active): 4.5s
  • effective GPU pixel-munching time: 1.9s
  • effective fillrate: 137,970,526 pixels/s

Hmm.. We currently waste some precious unified-shader time in vertex jiggling. Let's knock down a bit that mesh complexity. Let's switch to a grid of 2x2 cells (8 triangles).

  • number of vertices: 9
  • number of indices: 24
  • elapsed time for 1000 frames, drawing ON: 5.43s
  • elapsed time for 1000 frames, drawing OFF (culling active): 3.81s
  • effective GPU pixel-munching time: 1.62s
  • effective fillrate: 161,817,284 pixels/s

That's not so far from the theoretical 166 Mpix/s the iMX515 is rated at. When taking into account that timing above is app-based (through the 'time' utility) and app does some house-keeping, etc, I think we can assume our result is within a reasonable error from the theoretical maximum. Also, by now it should be clear why we chose to measure pixel workloads - because we can verify them against the specs.

So, at this stage we have a proven way to separate the GPU workload time from the time of other workloads in our drawing pipeline. So let's track down that CPU workload.

Clearly, in the case of a-few-'dumb'-pixels-worth of GPU work, the draw-call costs are not pretty - that's 1.62s of GPU work vs. 3.81s of 'non-GPU' work, or 0.425:1 in favor of the CPU. Ouch. We really need to try to increase the amount of work we pass down per draw call. The simplest way for that would be through, yep, you guessed it right - increasing the resolution. So,

  • viewport of 1024x768 pixels
  • viewport-spanning grid mesh (same indexed triangle list)

    • number of vertices: 9
    • number of indices: 24

  • elapsed time for 1000 frames, drawing ON: 14.03s
  • elapsed time for 1000 frames, drawing OFF (culling active): 9.26s
  • effective GPU pixel-munching time: 4.77s
  • effective fillrate: 164,870,440 pixels/s

Now the ratio is 0.515:1 - slightly better, but still not good. Let's see what might be the issue. Let's do more than 1 draw call per frame, as that would give us some idea if something else might be taking place in our frame.

  • viewport of 1024x768 pixels
  • viewport-spanning grid mesh (same indexed triangle list), drawn 4 times (i.e. 4 draw calls per frame)

    • number of vertices: 9
    • number of indices: 24

  • elapsed time for 1000 frames, drawing ON: 28.37s
  • elapsed time for 1000 frames, drawing OFF (culling active): 9.39s
  • effective GPU pixel-munching time: 18.98s
  • effective fillrate: 165,739,093 pixels/s

Aha! The picture changed drastically - the GPU-pixel-workload vs other-stuff ratio is now 2.021:1 in favor of the GPU! And that happened after we increased the number of draw calls from 1 to 4 per frame.

That comes to show that we have other expenditures in our frame. That would indicate that we should not try to bluntly decrease the number of our draw calls, but instead find the 'sweet spot' where a light GPU workload can be spread across a few draw calls, and that would still be 'for free', or pretty cheap in the timing of our frame. Unfortunately, here we leave the area of synthetic tests and hypotetsizing, and step into the real-world workloads. Or in other words, our small investigation ends here. We may return to it in the future, perhaps with some real-world data to analyze.
  While on the subject of memcpy..
posted by blu on 23rd September 2010

As we mentioned memcpy in the previous post, perhaps now is the right moment for a diversion from the core subject of the discussion and talk a bit about memcpy. IIRC, the ancient Sumerians had a saying along the lines of 'even if a man lived a good life, one day he'd have to face memcpy'.

The fastest (non-builtin) memcpy I've met yet on the iMX515 is the one from Android's bionic libraries - I guess Google got fed up with the (e)glibc stock version, which is, well, a last resort for moving data around on a Cortex platform. In contrast, Android's version uses NEON loads/stores, empirically-tuned read-prefetch patterns and all that jazz - overall a very reasonable memcpy effort.

Unfortunately, even that well-designed routine is not quite optimal under certain conditions - namely, when copying relatively small amounts of data that fit in the L1 d-cache (32KB on iMX515 - the maximal amount supported by Cortex-A8), particularly when the destination of those data happen to already reside in L2 (combined cache, 256KB on iMX515). Under such conditions the performance you'd normally get from Android's memcpy is as if data were being moved around L2 alone, without seeing much help from L1. But why? Did we not deliberately specify that our data fit in L1? Yes, we did, and yet that does not spare us those L1 write-misses.. Wait, what write misses? The answer is simple (and I was blissfully oblivious to it until last week) - Cortex-A8's L1 d-cache does not operate in a write-allocate fashion. What's the issue with that? Well, it's a tad counter-intuitive, and most CPUs don't do it, so some of us may not have encountered it previously. A8's L1 d-cache works in a write-back, but not write-allocate mode, which means that memory writes are not stalling the CPU pipeline when the memory location has been already cached, but memory writes to an uncached location do not cause a cache line to be synced from memory. Conversely, write-allocate would cause the CPU to keep a cache line with the location of your write-miss in anticipation of more accesses around that location. In our case A8 does no such thing for L1 - it interprets our writes as 'to cache if lucky, but not my problem otherwise'. As a result, for memory locations that were not in L1 beforehand, and which locations are accessed in a write-only streaming manner (which is the case for the destination of memcpy) A8 would serve us with a constant sequence of L1 write-misses. Oh goodies, we just lost our L1 for writing!

Luckily, the solution to that is equally simple - we need to revert to 'manual control' and instruct the CPU to cache those locations we are trying to write to. For the Android's bionic memcpy, that could be achieved through a blunt 'prefetch destination' at the start of the main copy loop:

--- really_tiny_libc/memcpy.S 2010-09-01 18:15:32.000000000 -0500
+++ ../../really_tiny_libc/memcpy.S 2010-09-19 11:31:01.000000000 -0500
@@ -98,6 +98,7 @@

1: /* The main loop copies 64 bytes at a time */
+ pld [r0]
vld1.8 {d0 - d3}, [r1]!
vld1.8 {d4 - d7}, [r1]!

The results speak for themselves (red is the original, blue is the tweaked version):

Android memcpy: measured read+write bandwidth by darkblu, on Flickr

Android memcpy: inferred one-way bandwidth by darkblu, on Flickr

Please note that the second chart is hypothetical, trying to give a naive answer to the question 'Good, but what if the access was one-way, and not read+write?' - to which the chart bluntly doubles the actual measured results ; )

So why not just patch Android's memcpy in the manner described above and enjoy eternal memcpy bliss on the A8? Well, firstly, that patch is detrimental for scenarios when the destination is not already in L2 - then our impromptu prefetch quickly becomes prohibitively expensive (as seen on the charts), as it's not operating from L2 anymore. Second, it's hardly worth the effort, as we don't get that often to copy data to a destination that is already in L2 - perhaps when packing scattered data into a single container, but other than that - not much. And lastly, we really should try to be good citizens and refrain from relying too much on memcpy, for our own sake.
  Profiling the i.MX515 GL stack
posted by blu on 5th September 2010

My experience with the EfikaMX GLES has been very exciting so far, to say the least. From the remarkably feature-complete ES2 and EGL software stacks, to the very intriguing GPU - the machine the size of a DVD case has been firmly holding my curiosity, and making me excited about what the future will deliver on the platform, when the few present hurdles will be in the past.

Speaking of hurdles, the #1 issue in the current GL EfikaMX pipeline is the abnormally-slow path that an already-complete GL frame has to traverse to finally get on the screen. Or IOW, what comes after a eglSwapBuffers() call (sans the drawing part that also takes place then on a tiler). From the OProfile sessions*, it appears that huge amounts of memory are getting moved around (think libc's memcpy) for each frame displayed. Moreover the amount of memcopy workload appears proportional to the active framebuffer size. Now, while there are multiple possible explanations why such a thing could happen, this is something that really has no place in a well-tuned graphics pipeline.

To get a better idea what we lose in the current situation, a simple modification to any rendering loop can be made, so that eglSwapBuffers is (largely) substituted for glFinish - the blocking GL(ES) call that causes the GL pipeline to be executed to framebuffer completion, but no further steps taken - darn useful debug functionality for situations like this one! In a multi-buffering setup (which is what most GL software ever uses), a hypothetical swapFramebuffers could look like:

void swapFramebuffers(const bool skip_show)
if (!skip_show)
eglSwapBuffers(display, surface);

Depending on the amount of time skip_show is set (that is, most of the time) we observe a very curious result from the EfikaMX GL: for sufficiently 'light-draw' frames, the framerate could double, or even triple with this modification, just because we don't take the final step and show the drawn frame on screen. But even for heavy-draw frames (say, fancy shaders), that little trick can add a few frames/s to the framerate. How come? It's really simple.

In scenarios where frames are light, the relative time where the GPU actually does work for us, vis-a-vis the 'housekeeping' work the system (read: mostly CPU) does, is in favor of the latter. By saying 'no, thanks' to the final housekeeping step of showing the frame on screen, we speed up the entire pipeline. Now, the amount of the speedup is proportional to the time said portion of housekeeping work takes. And since on the current EfikaMX software stack this is abnormally big, so is the amount of speedup we obtain.

So, if you want to see how fast your GL frames actually are on the EfikaMX today, employ the above trick. Hey, you might be surprised how fast you could be on the platform - I was! ; )

ps: ntsh-jass has already been modified to allow for a user-specified frame-skip factor.

* OProfile 0.9.6 /w kernel support, kernel from efikamx-10.07.11 branch at gitorious.org.
Genesi Network: Genesi - Main Site Power2People PowerDeveloper