Archive for the ‘Optimization Pills’ Category

Recently, I dived into a complicated code which aim was to compute some mathematical functions employed by more tangled subsystems. The code was very cool and I immediately thought that it was written by one (or more) scientific computing expert because of its sophistication and correctness. You’re wondering why I dived into that code because of the old saying “what works should never be changed“. Another time, efficiency cooked our goose!

I didn’t know if the code was very slow or terribly slow, anyhow, it could be improved, that was sure.

To let you taste some evidences of mathematical computations cost, these are the average clock cycles required by an IBM Power3 to compute common operations (thanks to CASPUR):

Instruction

32 bit

64 bit

Integer Product

3-4

3-9

Integer Division

21

34

Real Sum and Product

3-4

3-4

Real Division

14-21

18-25

Real Square Root

14-23

22-31


It’s clear we have to avoid expensive operations like divisions and try to reduce them to simpler ones, like sum and product. I’ll give you an example:

// Matrix Normalization

for (int i=0; i<N; ++i)
     for (int j=0; j<N; ++j)
          c[i][j] = a[i][j] * b[i][j]/(norm)
This code performs N*N products and N*N divsions. Can we do better? Sure:

// Matrix Normalization

rev_norm = 1.0/(norm);

for (int i=0; i<N; ++i)
     for (int j=0; j<N; ++j)
          c[i][j] = a[i][j] * b[i][j] * rev_norm

This time we need 2*N*N products but only one division. Try on your architecture and measure the results – take into account potential approximation errors. Another time, efficiency is about trade-off.

This kind of optimization is called Strength Reduction, because the target is to reduce the cost (time and/or memory) of the computation.

Other useful considerations about integer divisions (same rules apply to modulo calculations):

  • integer division by a power of 2 can be done with a shift operation, which is much faster;
  • integer division by a constant is faster than division by a variable;
  • integer division by a constant is faster if the dividend is unsigned.

Another notable example is the computation of powers:


pow(a,4) <=> a * a * a * a;

If you execute two simple loops like the following:

for (int i=0; i<10000000; ++i)
	toCompute = a * a * a * a;

for (int i=0; i<10000000; ++i)
	toCompute = pow(a,4);

it should be quite evident the big difference in terms of time: I measured something like 0.02 secs for the former loop and 0.586 for the latter!

The code I told you about at the beginning of this post was full of pows and other similar “expensive” functions. It was easy to change them and to achieve better results.

I’m not going to show you wagons of tricks, you can both search the web and read a good book/manual, like Optimizing software in C++ by Agner Fog (that is free).

Final considerations: sometimes it can be convenient to rewrite by hand some mathematical functions. Something like sin() is coded in order to equally minimize the error on the whole domain of the real function. But if you are interested in a certain range (relatively little) or smaller precision then you can reduce time of computations by using:

  • Series,
  • Approximate Formulas,
  • Polynomial and Rational Fitting,
  • Trigonometric Formulas.
Bear in mind numerical properties:
  • Accuracy,
  • Error accumulation,
  • Errors distribution on the function domain.
For example, sin(x) is defined on [-π,π]. But if the range of x is limited, it can be better to make use of a few-terms series. For x in [-0.1,0.1] we have:

 

Time

Error

Max Gain

Runtime Library

0.41’’

III order

0.19’’

8.3e-7

54%

V order

0.24’’

1.5e-10

42%

VII order

0.28’’

5.0e-11

32%

 

The accused code had a lot of calls to standard functions despite of the variables lie in a very limited range. This led me to use one of the techniques I mentioned above.

 

I hope this brief survey can help you to consider alternatives, especially talking about efficiency. Don’t fed up if I repeat: Efficiency is about trade-offs. Indeed, the whole software is about trade-offs, but sometimes the efficiency is such  a underhand requirement that your client expects although he/she didn’t say anything about it.
Advertisements

Boss around map insertions

Posted: June 13, 2011 in Optimization Pills
Tags: ,

This pill deals with maps, [from the reference] kind of associative container that stores elements formed by the combination of a key value and a mapped value.

In a map , the key value is generally used to uniquely identify the element, while the mapped value is some sort of value associated to this key. Types of key and mapped value may differ. For example, a typical example of a map is a telephone guide where the name is the key and the telephone number is the mapped value.

Internally, the elements in the map are sorted from lower to higher key value following a specific strict weak ordering criterion set on construction.

As associative containers, they are especially designed to be efficient accessing its elements by their key (unlike sequence containers, which are more efficient accessing elements by their relative or absolute position).

We could spend hours on end talking about maps, but this time we are interested in one of the simplest operation: insertion. Particularly, you said me you’re confused by the existence of at least two ways to insert new elements in a map: the subscript operator [] and the method insert.

The good news is they both perform a proper insertion, the bad news is that you should choose carefully among them when efficiency is important.

Part of this discussion is inspired by Meyers’ Effective STL (item 24).

Let’s suppose we have a map <K, V> myMap. What about this expression?


myMap[k] = v;

It checks to see if k is already in the map. If not, it’s added, along with v as its corresponding value. If k is already in the map, its associated value s updated to v. Then, first the method returns a reference to the value object associated with k, then it assigns v to the object to which the reference refers. Simple, but what if k is not in the map?

In this unlucky case, it creates a new value (and a new key too) from scratch by using the value type’s default constructor. We first default construct a new V, then we immediately assign it a new value. Now it should be clear why this approach may degrade performance.

We’d better off replacing our use of operator[] and assignment with a straightforward call to insert:

myMap.insert(std::map<K,V>::value_type(k,v));

This has precisely the same ultimate effect as the code above, except it typically saves you three function calls: one to create the temporary default-constructed V object, one to destruct that temporary, and one to V’s assignment operator. What if your V object is, for example, a complete 3D model (mesh, textures, materials, etc)?! Just say no to operator[] and save your clock and cache!

On the other hand, the call to insert requires an argument of type std::map<K,V>::value_type (for example, pair<K,V>), so when we call insert, we must construct and destruct an object of that type. This is not good when performances affect us during an update. But remember we can employ the subscript operator because it uses no pair object, so it does not cost us neither a pair destructor nor constructor.

Efficiency considerations thus lead us to conclude that insert is preferable to operator[] when adding an element to a map, and both efficiency and aesthetics dictate that operator[] is preferable when updating the value of an element that’s already in the map.

Meyers wrote a useful function providing the best of both worlds, something that sounds like efficient add-or-update:

template< typename MapType, // type of map
          typename KeyArgType, // type of Key
          typename ValueArgtype> // type of Value

typename MapType::iterator efficientAddOrUpdate(MapType& m,
                                                const KeyArgType& k,
                                                const ValueArgtype& v)
{
     typename MapType::iterator Ib = m.lower_bound(k); // find where k is or should be

     if ( Ib != m.end() && // if Ib points to a pair
          !(m.key_comp()(k, Ib->first))) { // whose key is equivalent to k...

          Ib->second = v; // update the pair's value        

          return Ib; // and return an iterator to that pair
     }
     else {
          typedef typename MapType::value_type MVT;

          return m.insert(Ib, MVT(k, v)); // add pair(k, v) to m and return an iterator
                                          // to the new map element

     }
}

The insert branch is interesting because it uses the “hint” form of insert. Ib identifies the correct insertion location for a new element with key equivalent to k, and the Standard guarantees that if the hint is corrent, the insertion will occur in constant, rather than logarithmic, time.

Don’t forget to perform an equivalence test (not equality) and remember that the correct comparison function is available via map::key_comp.

Keep in mind what happens behind the scenes of operator[] and insert. Choose your friend according to the context you’re facing with!

Talking about graphics programming (such as videogames or scientific visualization) it can happen you’re developing something that looks great and cool, with impressive graphics and amazing models. The screenshots, you proudly uploaded on your website, show awesome shaders and natty particle effects, making anyone want to get it.

Alas, your success lasts until the first nerd downloads it. “Hey, my mouse moves at a sluggish pace. Loading a small model requires more than 2 minutes and this damned application is killing my paging system…“, the first comment freezes the download page.

What’s the matter?” you’re unable to understand. “On my quad core works fine, I’ve just 12 GB RAM and an NVIDIA GTX 590.

Too often, developers code with a high-end configuration in mind, and thus the complete product slows down in less powerful equipment.

If you sold your application, almost none would buy it. You are not in position to dictate terms on the hardware market, unless you’re the new EA, Ubisoft or Rockstar.

If this is your situation, time has clearly come for some performance tuning. Be prepared to spend long hours of hard work that will recover those lost frames-per-second and make your application come back from the land of the dead.

This pill is going to be about tuning of the geometry pipeline. You know, each time we are rendering a scene, the graphics card has to perform some steps. Its target is to transform data (vertices in object space you send from the RAM to your videocard – through the proper driver) to pixels (something on-screen renderable). More data = more transformations = more slowness.

The overall workload of the geometry stage is directly related to the amount of processed primitives (vertices, usually). It is also dependent on the amount of transformations to be applied, number of lights, and so on. Thus, being efficient in the way we handle these objects is the best way to ensure maximum performance.

Begin by making sure the minimum number of transformations are applied to a given primitive. For example, if you are concatenating several transformation matrices (such as a rotation and a translation) to the same object, ensure you compose them into one matrix, so, in the end, only one matrix is applied to vertices. Many modern APIs will perform this operation internally, but often you will have to combine hand-computed matrices yourselves. For example:

matrix m1,m2;
for (i=0; i<1000; ++i) {
   point vertices[i] = m2 * (m1 * originalVertices[i]);
}

needs to perform 2000 point-matrix multiplies: 1000 to compute m1 * originalVertices[i] and 1000 to multiply that result by m2. Rearranging the code to:

matrix m3 = m2 * m1;
for (i=0; i<1000; ++i) {
   point vertices[i] = m3 * originalVertices[i];
}

duplicates performance because we have precomputed the resulting matrix manually. This technique can be applied whenever you have hierarchical transformations.

Another good idea is to use connected primitives whenever possible, because they send the same data using less primitive calls than unconnected ones. The most popular type of such primitives is triangle strips, which avoid sending redundant vertices that add no information and take up bus bandwidth. An object stored using triangle strips can require about 60–70 percent of the primitives from the original, nonconnected object. Any model can easily be converted to triangle strips by using any of the stripping libraries available (for example, NVTriStrip).

Indexed primitives are also headed in the same direction. Preindexing the mesh allows us to divide the object into geometry (that is, unique vertices that compose it) and topology (that is, face loops). Then, transformations can only be applied to the unique vertex list, which will be much shorter than the initial, nonindexed version. For example, OpenGL provides Vertex Arrays and Vertex Buffers. The latter is recommended for static (fixed) geometry.

As far as lighting goes, many hardware platforms can only support a specific number of hardware light sources. Adding more lamps beyond that limit will make performance drop radically because any extra lamps will be computed on the software. Be aware of these limits and implement the required mechanisms so lamp number never exceeds them. For example, it doesn’t makes sense to illuminate each vertex with all the possible lamps. Only a few of them will be within a reasonable distance to influence the resulting color. A lamp selection algorithm can then be implemented so closer lamps are used, and the rest, which offer little or no contribution, are discarded, thus making sure hardware lighting is always used.

Remember also that different kinds of light sources have different associated costs: directional lights (which have no origin and only a direction vector, such as the sun) are usually the cheapest, followed by positional lights (lights located at a point in space, lighting equally in all directions). In addition, spotlights (cones of light with origin, direction, and aperture) are the most expensive to compute. Bear this in mind when designing your illumination model.

Last tips deal with the Rasterizer stage, that is where the raw power of the graphics subsystem is shown. In an ideal world, we would paint each pixel exactly once, so zero overdraw would be achieved. But this is seldom the case. We end up redrawing the same pixel over again, making the Z-buffer go mad.

How can you rearrange your code in such a way that overdraw is reduced?

For example, you can turn culling on, reducing the number of incoming triangles. Another factor that has an impact on rasterization speed is primitive Z-ordering. The way Z-buffers work, it is more efficient to send primitives front-to-back than the other way around. In fact, if you paint back-to-front, each new primitive is accepted, needs to be textured and lit, and updates the Z-buffer position. The result? Lots of overdraw. Conversely, if you start with the closest primitive, you will discard all subsequent primitives early in the pipeline because their Z-value will not make it to the Z-buffer. The result? Part of the lighting and texturing will be saved, and performance will increase.  If you use an Octree you can traverse it so you are rendering front-to-back. Other more sophisticated techniques can be employed.

You can also disable the Z-buffer completely for many large objects, thus increasing performance. A skybox, for example, will certainly be behind everything else. Therefore, it is safe to paint it first, with Z-buffer disabled.

I hope you’re going to think about this pill when developing your next extra-cool graphical application. Remember not to make your pipeline get fat!

In this pill, I’m going to deal with a little detail of C (and C++) structs. This becomes memory-evident if you are handling a lot amount of data. If you’re ill of bad struct alignment then sugar this pill and you’ll be on the safe side!

Let’s start.

[from Wikipedia] A struct is a “structured” record type that aggregates a fixed set of labelled objects, possibly of different types, into a single object. A struct declaration consists of a list of fields, each of which can have any type. The total storage required for a struct object is the sum of the storage requirements of all the fields, plus any internal padding.

What is this padding?

Consider what happens when small data members are interspersed with larger members, like that:

struct BadStruct {
   U32   mU1; // 32 bits
   F32   mF2; // 32 bits
   U8    mB3; // 8 bits
   I32   mI4; // 32 bits
   bool  mB5; // 8 bits
   char* mP6; // 32 bits
}

You might imagine that the compiler simply packs the data members as tightly as it can. However, this is not usually the case. Instead, the compiler will typically leave “holes” in the layout (some compilers can be requested not to leave these holes by using a preprocessor directive like #pragma pack, or via command-line options).

Why does the compiler leave “holes” ?

The reason lies in the fact that every data type has a natural alignment which must be respected in order to permit the CPU to read and write memory effectively. The alignment of a data object refers to whether its address memory is a multiple of its size (which is generally a power of two):

  • An object with one-byte alignment resides at any memory address;
  • An object with two-byte alignment resides only at even addresses;
  • An object with four-byte alignment risides only at addresses that are multiple of four;
  • A 16-byte aligned object resides only at addresses that are multiple of 16.

Alignment is important because many modern processors can actually only read and write properly aligned block of data. For example, suppose a 32-bit architecture, if a program requests that a 32-bit (four-byte) integer be read from the address 0x6A341178, the memory controller will load the data happily because the address is four-byte aligned (its least significant nibble is 0x8). However, if a request is made to load the same data from 0x6A341177, the memory controller has now to read two four-byte blocks: the one at 0x6A341174 and the one at 0x6A341178. Furthermore, it has to mask and shift the two parts of the 32-bit integer and logically OR them together into the destination register on the CPU:

But with some microprocessors you can be unlucky: if you request a read or write of unaligned data, you might just get garbage (the Playstation 2 is a notable example of this kind of intolerance for unaligned data).

A good rule is that a data type should be aligned to a boundary equal to the width of the data type in bytes (for example, 32-bit values generally have a four-byte alignment requirement, 16-bit values should be two-byte aligned and 8-bit values can be stored at any addresses (one-byte aligned).

Let’s turn to our BadStruct: what’s a better alignment? For example, this:

struct BetterStruct {
   U32   mU1; // 32 bits
   F32   mF2; // 32 bits
   I32   mI4; // 32 bits
   char* mP6; // 32 bits
   U8    mB3; // 8 bits
   bool  mB5; // 8 bits
}

In this case, the last two bytes are in the same block, but the size of the structure as a whole is 20 bytes, not 18 as we might expect, because it has been padded by two bytes at the end. This padding is added by the compiler to ensure proper alignment of the structure in an array context. That is, if an array of these structs is defined and the first element of the array is aligned, then the padding at the end guarantees the all subsequent elements will also aligned properly.

A good thing is adding this pad manually to make the wasted space visible and explicit, like this:

struct BetterStructWithExplicitPadding {
   U32   mU1; // 32 bits
   F32   mF2; // 32 bits
   I32   mI4; // 32 bits
   char* mP6; // 32 bits
   U8    mB3; // 8 bits
   bool  mB5; // 8 bits
   U8    _pad[2]; // explicit padding
}

To sum up, we can have (at least) three possibilities:

  • the compiler leaves “holes” to guarantee a correct alignment, saving time (and cache) but wasting memory;
  • the compiler does not leave “holes” (data remain unaligned), saving memory. Two sub-possibilities:
    • the memory controller reads properly the data you need but, in case, it loads more blocks, performing bitwise operations, and wasting clock cycles and cache space;
    • the memory controller does not read happily the data you want, returning garbage.

The moral of the story? Keep your data as aligned as you can, adding explicit padding if needed. If you do this, you’ll (hopefully) avoid holes in your layout, saving time, space and program consistency.

Optimization Pills

Posted: February 10, 2011 in Optimization Pills
Tags: , ,

In this category, I’ll talk about optimization issues (especially in C++), trying to be pragmatic. What I’m going to show I’ve learned on my own, only a little part comes from the university. When you are there, it is not important to achieve good performance, then don’t get angry if you hear something like “efficiency is beyond this class” or “the DBMS will deal with optimization“.

That’s ok if efficiency does not matter to your application, if you’re developing another Java webapp with some cool functionalities and you don’t even know what a cache is and why the CPU has some registers.
The bad news is that if you deal with real-time systems, graphics, scientific computing, etc, you can’t waste time. You can’t afford it! You need your code to be more efficient, you need some tricks and best practices. Next, if you become capable, then you’ll be able to work out your own tricks. But keep in mind: you should know all about the CPU, the memory and other useful stuffs.

If you wanna step into the light, begin reading this paper: What every programmer should know about memory.

Furthermore, you can study on High performance computing, by K. Dowd and C. Severance (published by O’Reilly in 1998), Computer Systems A Programmer’s Perspective by R.E. Briant, D. O’Hallaron (published by Prentice-Hall in 2003). These are only a couple of titles. Search the Web for more resources.

Some applications require you to perform a job as soon as possible, meeting some external constraints about memory, time and quality.

No “Java.optimize” import or magic buttons to push here. Get ready to write code that is already optimized and if you can’t, try at least to optimize it next.

Let’s start with a simple question, I have these two C codes for matrix-matrix multiplication:

for(j=0;j<nn;j++){
	for(k=0;k<nn;k++){
		for(i=0;i<nn;i++){
		   c[i][j]=c[i][j]+a[i][k]*b[k][j];
		}
	}
}
for(i=0;i<nn;i++){
	for(k=0;k<nn;k++){
		for(j=0;j<nn;j++){
		   c[i][j]=c[i][j]+a[i][k]*b[k][j];
		}
	}
}

I wonder if it exists an ideal order for accessing the elements, if the answer is yes, I wonder if it depends on the language. I give you two page-refresh time to think about it! If you’re able to answer these questions, you’re able to figure out what the fastest code is.

I’ve made some tests using an Intel Core Duo 2.0 GHz, employing gcc 4.0.1. The matrices are 1024*1024:

loop order i,k,j : 2.29”

loop order j,k,i : 32.2” !

What’s the matter? I’m afraid it’s the fault of the cache. Let’s use cachegrind (a tool of valgrind) to simulate the cache. This is the j,k,i loop:

and this is the i,k,j loop:

Yes, it depends on the cache misusing. That’s because in C a matrix is stored by rows (because a matrix is a vector of vectors) and this is also a dependency on the language:

If you make the most of the cache, you can avail of the Principle of Locality: temporal locality refers to the reuse of specific data and/or resources within relatively small time durations. Spatial locality refers to the use of data elements within relatively close storage locations. In this context, the stride is the distance between two subsequently accessed elements. If you have unit stride, I have good news for you because you’re caching well.

This is a rough list of the data flow when your CPU has to read something:

1. Looking for the data;

2. L1 Cache look up (about 1-5 cycles);

3. L2 Cache look up (about 20 cycles);

4. Reading from the RAM;

5. Copying from the RAM to L2 cache (more than 100 cycles);

6. Copying from L2 cache to L1 cache;

7. Copying to the registers.

If you find the data in the L1 cache then you’ll jump to 7, saving a lot of time!

There exists a programming field called cache-friendly programming, focusing on the good using of the cache for achieving excellent results with high performance. This style of programming often leads to a bad code readability or other modifiability issues, then choose your battles and keep in mind your requirements.

I hope I’ll have time to write about cache-friendly programming (with cache blocking, padding, and other techinques), pipelined computing, out of order execution and how to take advantage of inner parallelism, with some tricks like loop unrolling, strength reduction, loop merging, loop splitting, function inlining, etc.