Optimization Pills

Posted: February 10, 2011 in Optimization Pills
Tags: , ,

In this category, I’ll talk about optimization issues (especially in C++), trying to be pragmatic. What I’m going to show I’ve learned on my own, only a little part comes from the university. When you are there, it is not important to achieve good performance, then don’t get angry if you hear something like “efficiency is beyond this class” or “the DBMS will deal with optimization“.

That’s ok if efficiency does not matter to your application, if you’re developing another Java webapp with some cool functionalities and you don’t even know what a cache is and why the CPU has some registers.
The bad news is that if you deal with real-time systems, graphics, scientific computing, etc, you can’t waste time. You can’t afford it! You need your code to be more efficient, you need some tricks and best practices. Next, if you become capable, then you’ll be able to work out your own tricks. But keep in mind: you should know all about the CPU, the memory and other useful stuffs.

If you wanna step into the light, begin reading this paper: What every programmer should know about memory.

Furthermore, you can study on High performance computing, by K. Dowd and C. Severance (published by O’Reilly in 1998), Computer Systems A Programmer’s Perspective by R.E. Briant, D. O’Hallaron (published by Prentice-Hall in 2003). These are only a couple of titles. Search the Web for more resources.

Some applications require you to perform a job as soon as possible, meeting some external constraints about memory, time and quality.

No “Java.optimize” import or magic buttons to push here. Get ready to write code that is already optimized and if you can’t, try at least to optimize it next.

Let’s start with a simple question, I have these two C codes for matrix-matrix multiplication:


I wonder if it exists an ideal order for accessing the elements, if the answer is yes, I wonder if it depends on the language. I give you two page-refresh time to think about it! If you’re able to answer these questions, you’re able to figure out what the fastest code is.

I’ve made some tests using an Intel Core Duo 2.0 GHz, employing gcc 4.0.1. The matrices are 1024*1024:

loop order i,k,j : 2.29”

loop order j,k,i : 32.2” !

What’s the matter? I’m afraid it’s the fault of the cache. Let’s use cachegrind (a tool of valgrind) to simulate the cache. This is the j,k,i loop:

and this is the i,k,j loop:

Yes, it depends on the cache misusing. That’s because in C a matrix is stored by rows (because a matrix is a vector of vectors) and this is also a dependency on the language:

If you make the most of the cache, you can avail of the Principle of Locality: temporal locality refers to the reuse of specific data and/or resources within relatively small time durations. Spatial locality refers to the use of data elements within relatively close storage locations. In this context, the stride is the distance between two subsequently accessed elements. If you have unit stride, I have good news for you because you’re caching well.

This is a rough list of the data flow when your CPU has to read something:

1. Looking for the data;

2. L1 Cache look up (about 1-5 cycles);

3. L2 Cache look up (about 20 cycles);

4. Reading from the RAM;

5. Copying from the RAM to L2 cache (more than 100 cycles);

6. Copying from L2 cache to L1 cache;

7. Copying to the registers.

If you find the data in the L1 cache then you’ll jump to 7, saving a lot of time!

There exists a programming field called cache-friendly programming, focusing on the good using of the cache for achieving excellent results with high performance. This style of programming often leads to a bad code readability or other modifiability issues, then choose your battles and keep in mind your requirements.

I hope I’ll have time to write about cache-friendly programming (with cache blocking, padding, and other techinques), pipelined computing, out of order execution and how to take advantage of inner parallelism, with some tricks like loop unrolling, strength reduction, loop merging, loop splitting, function inlining, etc.

  1. Alfredo Di Napoli says:

    Nice post!

  2. ugasoft says:

    better performance if you use cache oblivious algorithms!


    this is Data Oriented programming babe 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s