Sunday, January 13, 2013

Consumer/Producer approach for synchronizing buffer access using EGL fences

Motivation

At work I had the chance to do some application development related to displaying video content onto a 3D surface in Android. For what is worth, the use case was a little bit more involve than simply texturing one of the faces of a cube with Video. Texture streaming is nothing new as it has been done before usually through proprietary vendor extensions. More recently, the Android team has exposed the feature to application developers starting in Android 4.0 ICS release.

However, soon enough I saw myself reading the native code in C++ when a screen tearing problem showed up in some of the frames being rendered. The main issue came about as a result of access to the native buffer needing to be properly synchronized between the Video decoder and the 3D client app. Realize that both of these components are running in their own process and asynchronously.

So, in this post I discuss some of the things that I learned about this particular use case, specifically some of the EGL extensions that are required to efficiently stream video frames onto a 3D surface. More importantly, I go over a fairly new approach for synchronization (at the GPU driver level) using EGL sync objects that helped with the tearing problem discussed above (code snippets are provided at the bottom).

Why screen tearing

Screen tearing is a common problem in Graphics, and on my case it was due to the content of the buffer getting over-written by the Video decoder (at the wrong time) while the content was still being read. This problem is usually solved by using some type of synchronization to the buffer where access for write permission is granted after making sure that the content of the buffer has already been consumed (i.e., display to the main screen). 

Going back to the main use case, I knew that the buffers were getting overwritten, the only question remaining was, what type of synchronization should be used in this case? 

EGLImage extensions

But before answering the above question, let's talk about EGLImages as they represent an important building block when displaying Video content as OpenGL ES textures. The reason the Khronos group came up with the idea of EGLImage was to be able to share buffers across rendering APIs (OpenVG, OpenGL ES, and OpenMAX) without the need of extra copies. For example, consider a UI for which both 3D, and Video content can be written into from the same rendering context (think YouTube widget). Without this common data type the application would have to rely on a copy to move the data around (usually done through the glTexImage2D() call). If we consider video frames where the app needs to show them quickly, a lot of important resources can be wasted thus hindering performance (see Figure 1).


Figure 1. Data copies involve wasting important resources such as CPU cycles and memory bandwidth. Figure taken from [2].

With a common shared buffer across APIs the application is now able to reuse the EGLImage as both the destination of the decode and as a source for an OpenGL ES texture without copying any data (see Figure 2).


Figure 2. An EGLImage surface used as a double-purpose buffer. Figure taken from [2].

For the specification of all these extensions follow these links:

In summary, EGLImages provide a common surface that can be shared between rendering APIs. The feature proved so powerful in terms of performance and flexibility (more recently allowing YUV content besides RGB) that it became an important building block for many rendering engines (Android, Webkit to name a few). Follow this link to read more about the concept of DirectTextures using EGLImages in Android, and to understand why they perform so well in the rendering pipeline in Android.

EGL sync objects

A disadvantage for developers about the use of EGLImages is the issue about synchronization at the application level since any updates done to the buffer will also get reflected immediately at the other side in the OpenGL ES texture. Since applications are usually running at 60 fps, we must guarantee that the texture remain without glitches or artifacts for as long as 16 milliseconds, which is likely the interval it takes the display to refresh with new content.

Potentially, access to the buffers could be handle at the application level but that's far from ideal since it would seem too much of a burden for developers to have to deal with that much responsibility. Thankfully, the Khronos group has also made available another extension that takes care of inter-API fencing and signaling at the driver level. The main idea behind the fence is to be able to insert a 'fence' command into the client API, in this case, right before eglSwapBuffers() is called, and then have this sync object tested for completion when the entire frame have completed. Since the fence command was inserted as the second to last command in the list, the event for completion wouldn't be signaled until all the previous commands leading to the fence command has completed.

To put things in perspective using the Video as a texture use case, it's now fairly easy to imagine a separate thread for the Video decode continuously polling for when to start decoding new frames (thus in fact synchronizing the access to the buffer). The 3D client app, in the other hand, it's also busy constructing the scene by texturing the geometry when new Video content is available and finally putting a fence right before eglSwapBuffers is called. Because the 3D client app is in charge of putting the fence in the first place it can take as much time as it needs to display the frame without any glitches of tearing. After all, the fence object must guarantee the content of the buffer will remain intact until a signal is send across for when to update the buffer.

Figure 3. Consumer/Producer approach to synchronizing buffer access using EGL fence sync objects. Figure taken from [1].

For maximum performance by employing a queue of EGLImages it's possible to have both the Video decoder and the 3D client app working in parallel without blocking each other.

For the specification of the EGL fence sync extension follow this link:

Code snippets:

void *media_server(void* _d) {
   //Process other tasks until signaled
   if( cpu_access ) {
       updatePixels(void);
       cpu_access = false;
    }
}

void *sync_listener_callback(void){
EGLint value =0;
//blocks the calling thread until the specified sync object <sync> is
//signaled, or until <timeout> nanoseconds have passed.
EGLint result = eglClientWaitSyncKHR(dpy,
                        fence,
                        EGL_SYNC_FLUSH_COMMANDS_BIT_KHR,
                        EGL_FOREVER_KHR);
 if (result == EGL_FALSE) {
      printf("EGL FENCE: error waiting for fence: %#x\n",eglGetError());
      return;
    }
 result = eglGetSyncAttribKHR(dpy,fence,EGL_SYNC_STATUS_KHR,&value);
 if(value == EGL_SIGNALED_KHR) {
      cpu_access = true;
   }
  eglDestroySyncKHR(dpy,fence);
}

void *compositor(void){
   glVarious();

//By inserting a sync object just before eglSwapBuffers is called, it is //possible to wait on that fence allowing a calling thread to determine //when the GPU has finished writing to an EGLImage render target.
EGLSyncKHR fence = eglCreateSyncKHR(dpy, EGL_SYNC_FENCE_KHR, NULL);

   if (fence == EGL_NO_SYNC_KHR) {
      printf("EGL FENCE: error creating fence: %#x\n", eglGetError());
    }
   eglSwapBuffers(eglDisplay, eglSurface);
}

References:

[1] Imagination Technologies Ltd, "EGLImage, NPOT & Stencils PowerVR Performance Recommendations". Available Online
[2] Neil Trevett, Khronos Mobile Graphics and Media Ecosystem. Available Online.
[3] The Android Open Source Project. Unit tests for Graphics. Available Online.

Tuesday, April 19, 2011

Mipmapping and NPOT textures inside a pixel shader

In this post, I discuss my findings on image filtering, specifically how to do trilinear interpolation inside a pixel shader for non-power-of-two textures (npot). The shortcoming is due to the fact that the OpenGL ES 2.0 specification does have a restriction on the wrap modes that can be used if the texture dimensions are non power of two. It only allows CLAMP_TO_EDGE for the wrap mode whereas for the minification filter it only allows for GL_NEAREST or GL_LINEAR. In other words, for npot textures the GLES 2.0 spec doesn't require support for trilinear filtering! GPU vendors, however, are expected to fully support this feature through an extension (i.e., GL_OES_texture_npot) although not required.

In the next sections I explain some of the theory and math behind trilinear filtering followed by my own implementation using the OpenGL ES Shading Language (GLSL). At the end I show some of the screenshots I obtained from the pixel shader.

Texture Mapping

In real-time 3D Graphics objects are modeled in 3D space and images are mapped onto the faces of the objects. This not only add some realism to the scene, the GPU can do it quite inexpensively. The image data is loaded and gets converted into a 2D array where individual data elements are called texture elements or texels. When rendering with a 2D texture, a texture coordinate is used as an index into the texture image, and then mapped to the destination image (screen) by the viewing projection. Texture space is labeled (s, t) and screen space is labeled (x, y).

Figure 1. Pixel mapping to texture-space. Figure taken from [1].

When the texture gets resized (i.e., minimized) a visual artifact called aliasing might suddenly appear in the final image. This happens because as the geometry gets smaller and smaller the texture coordinates take large jumps when being interpolated from pixel to pixel. Thus, aliasing occurs when not enough samples are preserved from the original image that the final image looks jagged or pixelated. There are several texture filtering techniques that smoothly blend or interpolate adjacent pixels in order to avoid aliasing. The most common ones are bi-linear interpolation, trilinear interpolation (or mipmapping) and anisotropic filtering.


Texture Filtering

"Heckbert [12] defined texture filtering as the process of re-sampling the texture image onto the screen grid", Ewins et al [1]. Each screen coordinate (x, y) maps to a texture-space coordinate (s, t) as shown in Figure 1 above. The job of the texture filtering mechanism is then to efficiently determine which texel in the texture map correspond to what pixel in the screen. Since performing texture look-ups involves accessing texture memory this is in turn a time consuming operation", Ewins et al [1]. For this reason mipmaps (a filtering technique) was developed in order to reduce memory accesses since it relies on pre-filtered texture storage.

Mipmaps

The idea behind mipmaps is to generate a pyramid of textures, each level in the pyramid representing a level of detail l that hints the GPU where to sample from during texture minification. Each of the levels in the pyramid are exactly a scaled down version of the original texture in both dimensions. For example, if the original texture is 256 x 256 in size the next scaled down version would be 128 x 128 and so forth all the way down to 1 x 1 texel. Mipmapping helps with aliasing because thanks to the many levels that the GPU can now sample from the pixel to texel ratio is better preserved. More over, since now the texture fetches happen at a relative closer distance (the map is smaller) the GPU can better utilize the cache thus improving performance when using mipmapping over other filtering methods that doesn't. A clear disadvantage for mipmapping is that it requires extra storage space for all the additional levels.

Figure 2. Mipmap pyramid. Figure taken from [1]. 

The mechanism by which the GPU calculates the level of detail is not important. What's important is to be able to sample the different levels inside the pixel shader, and to do that we need to somehow calculate this number ourselves. In the next section I go over an approximation that has been discussed before that can help us determine the level of detail inside the shader. Once the level of detail is known, we then take two bilinear samples, one at the calculated level and the other one at the level below it. Finally, we return the color by doing a third linear-interpolation between these two levels to have trilinear interpolation.

Mipmap Level Selection


It's very common in Computer Graphics to represent a pixel as square. Building upon that one can roughly approximate a texture map as a parallelogram in texture space, see Figure 3. The mapping of the texels (s,t) in texture space with respect of the pixels (x,y) in screen space can be approximated using partial derivatives according to [1].

Figure 3. Pixel mapping to texture-space using constant partial derivatives. Figure taken from [1]. 

The vector length of both r1 and r2 can be calculated as

Eq1: Vector length r1

Eq2: Vector length r2

We then choose the level of detail based on the maximum compression of an edge in texture space, which corresponds to the maximum length of either side of the parallelogram in texture space:

Eq3: Maximum length of either side of parallelogram

We know a pixel at a level l covers an area,

For a parallelogram of a given area A, we can approximate its area by

Eq4: Level of detail in terms of pixel area

where the area (A) is then replaced by Eq3 above to approximate the level of detail.

Implementation

The implementation was written using the OpenGL ES Shading Language as found in any Open GL ES 2.0 implementation. It however, rely on two shader extensions that may or may not be supported in all hardware implementations out there. The first of this extensions is called GL_OES_standard_derivatives which give us the ability to calculate derivatives.


#extension GL_OES_standard_derivatives : enable


The second of the required extensions is called GL_EXT_shader_texture_lod and adds additional texture functions to the Shading Language allowing us to have explicit control of the level of details inside the mipmap pyramids. In other words, we can explicitly define which texture level to sample from.


#extension GL_EXT_shader_texture_lod : enable 


Both of these extensions can only be used inside fragment shaders.


float mipmapLevel(vec2 uv, vec2 textureSize)
{
  //rate of change of the pixels in u and v with respect to window space
  //approximate to au/ax, au/ay, av/ax, av/ay
  vec2 dx = dFdx( uv * textureSize.x);
  vec2 dy = dFdy( uv * textureSize.y);
  
  //select the LOD based on the maximum compression of an edge in texture space.
  //This corresponds to the maximum length of a side in texture space
  //max (sqrt(dUdx*dUdx + dVdx*dVdx),
  //    sqrt(dUdy*dUdy + dVdy*dVdy));
  float d = max( dot (dx, dx), dot( dy, dy));
  
  //convert d length to power-of-two level of detail
  return 0.5*log2(d);
}

vec4 texture2D_trilinear( sampler2D tex, vec2 uv)
    float level= mipmapLevel(uv, u_texsize);

    //sample the current level
    vec4 t00 = texture2DLodEXT(tex, fract(uv), level);
    
    //sample the level directly below it
    vec4 t01 = texture2DLodEXT(tex, fract(uv), level+1.);
    //linear interpolate the two levels
    return mix(t00, t01, fract(level));
}

void main()
{
       gl_FragColor = texture2D_trilinear( colorMap, v_texCoord.st);
}

First we start with the mipmapLevel() function. In this function we use the derivatives to figure out the rate of change of the texture coordinates (u,v) with respect to screen coordinates. Note that since we are interested in non-power-of-two textures we need to scale up the partial derivatives by the texture dimensions in both x and y. For this I just use a uniform u_texsize that contains the texture dimensions. We then calculate the maximum value of the two edges in the parallelogram and simply return the base 2 log() of this length. 

After we figure out the level of detail we are then ready to start sampling from the mipmap levels. From the application in GL we set the minification filters to GL_LINEAR_MIPMAP_NEAREST in order to take a bilinear fetch from the closet mip level chosen. After fetching from the two levels that we are interested in we return the color by doing one last linear interpolation to give us a trilinear filtering.

Note that the level of detail is calculated as a real number (l = 0.f) where the fraction (f) is being used as the weight factor in the linear interpolation, see Figure 2 above. This is done to produce a smooth linear blend between the levels.



Results



References

[1] Ewins JP, Waller MD, White M, Lister PF. MIP-map level selection for texture mapping. IEEE Transactions on Visualization and Computer Graphics 1998. Available online.
[2] J.P. Ewins, M.D. Waller, M. White, and P.F. Lister, “An Implementation of an Anisotropic Texture Filter,” Technical Report IWD_172, Centre for VLSI and Computer Graphics, Univ. of Sussex, 1998. Available online.
[3] Munshi, Aartab; Ginsburg, Dan; Shreiner, Dave. OpenGL ES 2.0 Programming Guide. Addison-Wesley Professional.
[4] Gerasimov, Phillip, Randima, Fernando, Green, Simon. "Shader Model 3.0: Using Vertex Textures." NVIDIA white paper, June 2004. Available online.
[5] Marschner, Steve. "Texture filtering" CS 4620 Lecture notes, Fall 2008, Cornell University. Available online.
[6] Guinot, Jerome. "The art of texturing Using the OpenGL Shading Language". April 15, 2006. Available online.
[7] Flavell, Andrew. "Run-Time MIP-Map Filtering". December 11, 1998. Available online.