Make your geometry pipeline have a light meal

Posted: April 3, 2011 in Optimization Pills
Tags:

Talking about graphics programming (such as videogames or scientific visualization) it can happen you’re developing something that looks great and cool, with impressive graphics and amazing models. The screenshots, you proudly uploaded on your website, show awesome shaders and natty particle effects, making anyone want to get it.

Alas, your success lasts until the first nerd downloads it. “Hey, my mouse moves at a sluggish pace. Loading a small model requires more than 2 minutes and this damned application is killing my paging system…“, the first comment freezes the download page.

What’s the matter?” you’re unable to understand. “On my quad core works fine, I’ve just 12 GB RAM and an NVIDIA GTX 590.

Too often, developers code with a high-end configuration in mind, and thus the complete product slows down in less powerful equipment.

If you sold your application, almost none would buy it. You are not in position to dictate terms on the hardware market, unless you’re the new EA, Ubisoft or Rockstar.

If this is your situation, time has clearly come for some performance tuning. Be prepared to spend long hours of hard work that will recover those lost frames-per-second and make your application come back from the land of the dead.

This pill is going to be about tuning of the geometry pipeline. You know, each time we are rendering a scene, the graphics card has to perform some steps. Its target is to transform data (vertices in object space you send from the RAM to your videocard – through the proper driver) to pixels (something on-screen renderable). More data = more transformations = more slowness.

The overall workload of the geometry stage is directly related to the amount of processed primitives (vertices, usually). It is also dependent on the amount of transformations to be applied, number of lights, and so on. Thus, being efficient in the way we handle these objects is the best way to ensure maximum performance.

Begin by making sure the minimum number of transformations are applied to a given primitive. For example, if you are concatenating several transformation matrices (such as a rotation and a translation) to the same object, ensure you compose them into one matrix, so, in the end, only one matrix is applied to vertices. Many modern APIs will perform this operation internally, but often you will have to combine hand-computed matrices yourselves. For example:

matrix m1,m2;
for (i=0; i<1000; ++i) {
   point vertices[i] = m2 * (m1 * originalVertices[i]);
}

needs to perform 2000 point-matrix multiplies: 1000 to compute m1 * originalVertices[i] and 1000 to multiply that result by m2. Rearranging the code to:

matrix m3 = m2 * m1;
for (i=0; i<1000; ++i) {
   point vertices[i] = m3 * originalVertices[i];
}

duplicates performance because we have precomputed the resulting matrix manually. This technique can be applied whenever you have hierarchical transformations.

Another good idea is to use connected primitives whenever possible, because they send the same data using less primitive calls than unconnected ones. The most popular type of such primitives is triangle strips, which avoid sending redundant vertices that add no information and take up bus bandwidth. An object stored using triangle strips can require about 60–70 percent of the primitives from the original, nonconnected object. Any model can easily be converted to triangle strips by using any of the stripping libraries available (for example, NVTriStrip).

Indexed primitives are also headed in the same direction. Preindexing the mesh allows us to divide the object into geometry (that is, unique vertices that compose it) and topology (that is, face loops). Then, transformations can only be applied to the unique vertex list, which will be much shorter than the initial, nonindexed version. For example, OpenGL provides Vertex Arrays and Vertex Buffers. The latter is recommended for static (fixed) geometry.

As far as lighting goes, many hardware platforms can only support a specific number of hardware light sources. Adding more lamps beyond that limit will make performance drop radically because any extra lamps will be computed on the software. Be aware of these limits and implement the required mechanisms so lamp number never exceeds them. For example, it doesn’t makes sense to illuminate each vertex with all the possible lamps. Only a few of them will be within a reasonable distance to influence the resulting color. A lamp selection algorithm can then be implemented so closer lamps are used, and the rest, which offer little or no contribution, are discarded, thus making sure hardware lighting is always used.

Remember also that different kinds of light sources have different associated costs: directional lights (which have no origin and only a direction vector, such as the sun) are usually the cheapest, followed by positional lights (lights located at a point in space, lighting equally in all directions). In addition, spotlights (cones of light with origin, direction, and aperture) are the most expensive to compute. Bear this in mind when designing your illumination model.

Last tips deal with the Rasterizer stage, that is where the raw power of the graphics subsystem is shown. In an ideal world, we would paint each pixel exactly once, so zero overdraw would be achieved. But this is seldom the case. We end up redrawing the same pixel over again, making the Z-buffer go mad.

How can you rearrange your code in such a way that overdraw is reduced?

For example, you can turn culling on, reducing the number of incoming triangles. Another factor that has an impact on rasterization speed is primitive Z-ordering. The way Z-buffers work, it is more efficient to send primitives front-to-back than the other way around. In fact, if you paint back-to-front, each new primitive is accepted, needs to be textured and lit, and updates the Z-buffer position. The result? Lots of overdraw. Conversely, if you start with the closest primitive, you will discard all subsequent primitives early in the pipeline because their Z-value will not make it to the Z-buffer. The result? Part of the lighting and texturing will be saved, and performance will increase.  If you use an Octree you can traverse it so you are rendering front-to-back. Other more sophisticated techniques can be employed.

You can also disable the Z-buffer completely for many large objects, thus increasing performance. A skybox, for example, will certainly be behind everything else. Therefore, it is safe to paint it first, with Z-buffer disabled.

I hope you’re going to think about this pill when developing your next extra-cool graphical application. Remember not to make your pipeline get fat!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s