воскресенье, 31 июля 2011 г.

OpenGL ES geometry instancig

This article will compare 2 different instancing approaches available on OpenGL ES capable platforms. These are most suitable for 2D rendering.
The first one is called software instancing. The idea of this technique is to calculate all vertexes positions on the CPU.
The second one is called hardware instancing. I didn't found any word about it anywhere on the web, so, i guess, i am the inventor of it :) (but, rather, i am the first one, who described it). The idea is to pass modelview matrix fro each vertex as an attribute. I will compare all the pros & cons of this solutions later. Let's get to the main actual code and comparison.
I made 3 tests, each one is different only by amount of sprites rendered - 100, 1000, 10000, for 4 techniques: software instancing, hardware instancing, one dip per sprite (the worst possible case) and precomputed positions at app launch (etalon value). Every frame position of sprite is changed to random one, but within screen boundaries.
I'll start from brief description of each technique in pseudo code.  We are interested only in shader\rendering code, so i will omit all other details, you will find them in source code.

1. Software Instancing. This technique can be implemented on FFP because it requires to compute all vertices on CPU.
    void UpdateSpritesPosition()
    {
    //allocate buffer for all sprites data
    //calculate each sprite vertex position and write it to buffer
    //write other vertex attributes to buffer 
    }
    void Render()
    {
    //bind vbo 
    glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer); const GLuint STRIDE = sizeof(data[0]);  
    //send vertex data glVertexAttribPointer(ATTRIB_POSITION, 2, GL_FLOAT, 0, STRIDE, NULL); glEnableVertexAttribArray(ATTRIB_POSITION); glVertexAttribPointer(ATTRIB_COLOR, 4, GL_UNSIGNED_BYTE, 1, STRIDE, (GLvoid*)8); glEnableVertexAttribArray(ATTRIB_COLOR); glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, indexBuffer); glDrawElements(GL_TRIANGLES, QUADS_COUNT * 6, GL_UNSIGNED_SHORT, 0);
    }


    The modelview matrix in this case is always identity. It's pretty clear what I'm doing in this case.


    2. Hardware Instancing. This technique requires programmable pipeline. Lets start from data processing code:
       
      void UpdateSpritesPosition()
      {
      //allocate buffer for all sprites modelview matrix data
      //calculate each sprite modelview matrix and write it to buffer
      //write other vertex attributes to buffer 
      }
      // rendering code is almost the same:
      void Render()
      {
      //bind vbo 
      //send vertex data including modelview matrices
      //draw 
      }

      I will devote more time to explain how this technique works, because i was unable to find a word about it anywhere. I will begin from the vertex shader code:


      attribute vec4 position;
      attribute vec4 sourceColor;
      attribute mat4 modelview;
      uniform mat4 projection;
      varying vec4 destinationColor;
      
      void main(void)
      {
          destinationColor = sourceColor;
          gl_Position = projection * modelview * position;
      } 
      


      I underlined the key aspect of this technique - modelview matrix is a vertex attribute. I just realized that it' a good idea to remove position as an attribute and store only modelview, this will decrease the amount of data send to GPU.

      Now let's look on the rendering code:

      glBindBuffer(GL_ARRAY_BUFFER, modelviewBuffer);
      
      glEnableVertexAttribArray(ATTRIB_MODELVIEW + 0);
       glEnableVertexAttribArray(ATTRIB_MODELVIEW + 1);
       glEnableVertexAttribArray(ATTRIB_MODELVIEW + 2);
       glEnableVertexAttribArray(ATTRIB_MODELVIEW + 3);
       glVertexAttribPointer(ATTRIB_MODELVIEW + 0, 4, GL_FLOAT, 0, STRIDE, (GLvoid*)0);
       glVertexAttribPointer(ATTRIB_MODELVIEW + 1, 4, GL_FLOAT, 0, STRIDE, (GLvoid*)16);
       glVertexAttribPointer(ATTRIB_MODELVIEW + 2, 4, GL_FLOAT, 0, STRIDE, (GLvoid*)32);
       glVertexAttribPointer(ATTRIB_MODELVIEW + 3, 4, GL_FLOAT, 0, STRIDE, (GLvoid*)48);
      
      glUniformMatrix4fv(uniforms[UNIFORM_PROJECTION], 1, 0, projection);
      glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer);
          
       const GLuint STRIDE = sizeof(data[0]);
       
       glVertexAttribPointer(ATTRIB_POSITION, 2, GL_FLOAT, 0, STRIDE, NULL);
       glEnableVertexAttribArray(ATTRIB_POSITION);
       glVertexAttribPointer(ATTRIB_COLOR, 4, GL_UNSIGNED_BYTE, 1, STRIDE, (GLvoid*)8);
       glEnableVertexAttribArray(ATTRIB_COLOR);
       
          glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, indexBuffer);
      glDrawElements(GL_TRIANGLES, QUADS_COUNT * 6, GL_UNSIGNED_SHORT, 0);
      
      

      Actually, that's all about rendering code. Pretty simple. Pay attention to how i send modelview matrix to shader. OpenGL ES Shading Language Specification says, that attribute can be of mat4 type. And further it describes memory layout of such attribute - 4 vec4 one followed by another. That means if you will take attribute slot form shader program for mat4f you will have to pass 4 vec4 to this slot and 3 more after it.


      The resulting comparison is: 

      As you can see instancing techniques are almost two times slower than etalon and 6 to 10 times faster, than single dip per sprite! So they are obvious thing you want to implement in your game.
      Now lets talk about optimization to each approach.

      1. Software instancing. I rewrote all vertex multiplication code to NEON instructions and got about 10-15% increase in speed. Not bad!
      2. Hardware instancing. All the data was split into 2 vbo - one static with initially set vertexes position and color, and the other one, streamed, with modelview matrices. Got few more FPS speedup. Cannot name the exact number simply because i don't remember them.

        Prons & cons:

        1. Software Instancing. 
        Pros: it can be implemented on OpenGL ES 1.1.
        Cons: it sucks alot of CPU power and you need it to do other stuff in the game.
        CPU utilization:
         


        2. Hardware Instancing.
        Pros: lower CPU processing time, while it remains as fast as a software instancing. Can suit more complex geometry.
        Cons: it can be implemented only on OpenGL ES 2.0.
        CPU utilization: 



        Check out sources for more details.

        Rendering technique is switched via define at the top of Drawing.m. Only one define should be set to 1 at a time, otherwise it will lead to undefined behavior.

        P.S. I'm in process of investigation of one more instancing technique.

        8 комментариев:

        1. A few points.

          I haven't seen your neon code but I routinely get 2.5 to 4 times speedup when switching to custom coded asm based Neon solutions, especially if you are dealing with large sets of vertices (which then can be vectorized much more easily)

          Your hardware instancing solution should cost just as much as software one since you have to update matrices with every frame which in turn means you may as well update vertex coordinates on the CPU side and if you do it right it won't be much slower than updating matrices themselves and will offload your GPU.
          With 3d games GPU load ( fill rate) tends to be much heavier than your CPU load so anything you can do on the CPU is a win.

          To summarize, with software instancing and transformation I can easily handle 70 000 vertices on iPhone 3gs ( the old one) and still run at 40 FPS.
          You just have to do it right ...

          ОтветитьУдалить
        2. I mean "With 2d games GPU load ( fill rate) tends to be much heavier ..."

          ОтветитьУдалить
        3. >I haven't seen your neon code
          I used GCC intrinsics for NEON in time, when i wrote this article. Later on i discovered asm approach, which gave me, if i remember right, about 4-6 speed up in matrix multiplication. I want to rewrite this article - add few more details an switch the code to NEON asm.

          >Your hardware instancing solution should cost just as much as software one
          It costs almost the same, 4-5% faster. This code is really limited by CPU, NEON asm can improve speed and make the code GPU limited.

          ОтветитьУдалить
        4. StiX, what a nice article indeed.

          I am using very similar technique for rendering many 2d geometries which I thought I 'invented' (lol, after reading this article I guess it is not the case :D ) when I wanted to merge all draw calls into one. I didn't think about it as hardware instancing but rather as a cool method for rendering large amount of 'dynamic geometry' using GPU transform.

          Anyway, when working with 2D all you need to send as additional attributes are position (vec2) and part of rotation matrix (vec2) which is just enough to to vertex transform in vertex shader. I think this op should be much faster in VS than its CPU (neon) variant because I think PowerVR GPU architecture can handle it (matrix multiplications) better. But I will definitely try to do some 'measurements' using CPU transform (with neon) and maybe it will help especially on old Cortex A8 with SGX535...

          Once again, thanks for doing tests and for providing your measurement results.

          ОтветитьУдалить
        5. >Anyway, when working with 2D all you need to >send as additional attributes are position
          >(vec2) and part of rotation matrix (vec2) >which is just enough to to vertex transform >in vertex shader.
          I know, but i store whole 4x4 matrix on client side, so i passed it into shader

          >I think this op should be much faster in VS >than its CPU (neon)
          No, it is not. iPhone GPU (at least SGX 535) runs at 200 MHz speed, iPhone CPU has one or two cores at 1 GHz speed - it will be much faster to do matrix multiplication on CPU. I don't own newer device with SGX543MP2\4, but i think NEON still will be faster.
          P.S. Vladimir, this is an old article, unfortunately I hadn't written NEON code that time to demonstrate it's supremacy and i don't have time for this now

          ОтветитьУдалить
        6. I'm not sure I understand how your hardware instancing works: Are you sending a ModelView matrix for Each vertex, even if it's the same across the whole model? Meaning if your model has 200 vertices, do you have to send 200 copies of the same matrix to the GPU? Can someone please clarify this aspect? Thank you

          ОтветитьУдалить