You are on page 1of 38

C++ Optimization Strategies and Techniques

Pete Isensee
World Opponent Network
3380 146th Place SE, Suite 110
Bellevue, WA 98007
Pete.Isensee@WON.net

Introduction
"More computing sins are committed in the name of efficiency (without necessarily
achieving it) than for any other single reason – including blind stupidity."
– W.A. Wulf

"On the other hand, we cannot ignore efficiency.”


– Jon Bentley
Many software engineers recommend what I call the “procrastination approach” to optimi-
zation. Delay optimization as much as possible, and don’t do it if you can avoid it. I agree with
the basic premise. Optimizing too early or too often is not a good approach to engineering.
Better to have a game that runs than a fast game that crashes. On the other hand, you’re not
likely to write a successful game these days without doing optimization at some point in the
process. Your compiler can help you, but you as a programmer understand more about your
game than the compiler. As Michael Abrash puts it, the best compiler is “between your ears.”
There are many levels of optimization, but I’m going to focus on one in particular: C++ optimi-
zations. Some of these techniques apply to other languages as well – like Java – but most are
specific to C++. I’ll also cover how to configure your compiler for maximum C++ efficiency.

Preliminaries
All of the examples are in C++. The code is designed to compile with any standard ANSI C++ -
compliant compiler. Some of the more complex techniques involve templates and the Standard
Template Library. I used Microsoft Visual C++ 6.0 for the example programs, targeting PCs
running Microsoft Windows 95/98 or NT.
Except where noted, all benchmarks and profiling were done on a Pentium II – 400MHz Dell
Dimension XPS400 running NT 4.00.1381. Most profiling runs were done with compiler optimi-
zations disabled to prevent any compiler-specific options influencing the results.
All performance graphs show relative performance. If the unoptimized run takes 200mS and
the optimized run takes 100mS, the optimized run will be shown as twice as tall as the unopti-
mized run (i.e. twice as fast). In other words, taller is better.
Most code examples use the following C++ objects for comparison
• int
• string (standard C++ basic_string<char> class with an average of 32 characters per string)
• complex (standard C++ complex<double> class containing two double values)
• bitmap (bitmap class with expensive default and copy ctor; average of 10000 pixels)
1: General Strategies

1.1: Optimization Strategies that Bomb

• Assuming some operations are faster than others. When it comes to optimizing, never ever
assume anything. Benchmark everything. Use a profiler. Even while I was doing examples
for this paper, some of my “optimizations” turned out to be major duds.
• Reducing code to improve performance. Reducing code might improve performance; it
might not. Increasing the amount of code will often improve performance. Loop unrolling is
a prime example.
• Optimizing as you go. Big mistake. As software engineer Donald Knuth said, “premature
optimization is the root of all evil.” Optimization is one of the last steps of a project. Plan for
it, but don’t optimize too soon. If you do, you’ll end up optimizing code that you either don’t
use or that doesn’t need to be streamlined in the first place. However, there are some
efficiency techniques you can use throughout your project from day one. These tips can
make your code more readable and concise. I’ve listed them below in section 3.
• Worrying about performance before concentrating on code correctness. Never! Write the
code without optimizations first. Use the profiler to determine if it needs to be revised. Don’t
ignore performance issues; let performance issues guide your design, data structures, and
algorithms, but don’t let performance affect every aspect of your code. In a typical game,
only a small percentage of the code requires optimization. Usually it’s the inner loops of the
blitting, AI or physics routines.

1.2: Optimization Strategies that Work


• Set design goals for performance levels. How responsive should the game be? Be specific.
Think in terms of concrete millisecond values. Put the values in the specs. What’s the
target frame rate? How will the game deal with Internet latency and bandwidth issues? In
what portions of the game is efficiency critical, and in what portions is efficiency secondary?
• Choose a program architecture appropriate to the problem at hand. When you evaluate
each option, whether it be the choice of using a single-threaded vs. multithreaded archi-
tecture, using a database vs. a flat file, using a custom engine or writing your own, be sure
to consider efficiency.
• Select the proper data structures. Carefully evaluate whether you should use floating point
or integer math, lists or vectors, hash tables or trees. The right data structures can make
the difference between a great game and a dog. That’s why the id team used BSP trees
instead of Z-buffering in Quake. BSP trees best solved their particular problem.
• Choose the right algorithms. A linear search may be more appropriate than a binary
search. Insertion sort may be faster than quicksort. See the discussion of swapping
algorithms in the next section to see how small algorithm variations can effect performance.
• Be sure to concentrate on perceived performance. Focus on how the game feels, not the
actual numbers. Perhaps increasing object velocity produces a better game than increasing
the frame rate. Try it. Use progress bars, animations or other effects to hide level loading or
other long processes. Be sure the game is responsive to player input, too. If your game
runs at 60 fps but takes half a second to process mouse clicks, it’s time to refocus your
optimization efforts where it will make a difference.
• Understand the costs of common programming operations and algorithms. Is integer
division faster than multiplication or the other way around? Is floating-point math faster than
integer math for your target platform? Knowing the answer can help you make your game
run faster. See Appendix B for the costs of common operations.
• Profile to find bottlenecks. Add your optimizations. Profile to find bottlenecks. Rinse and
repeat. Don’t settle on an optimization without verifying that it actually improves the game.
Always run “before” and “after” benchmarks to evaluate optimizations. Store the results of
each profiling run so you can compare the differences.
• Evaluate near-final code. Don’t evaluate the debugging version, and don’t evaluate the
release version too early – you’ll end up modifying code that you’ll throw out anyway.
• Use your QA department (or person) to create and automate profiling runs. If you don’t
have a QA department, automate the profiling runs yourself. You’ll never regret it.

1.3: Example of Selecting the Proper Algorithm: Swapping


Swapping objects is a common operation, especially during sorting. The standard C++
swapping function is very simple.
template <class T> void MySwap(T& a, T& b)
{
T t = a; // copy ctor
a = b; // assignment operator
b = t; // assignment operator
} // destructor (element t)

Swapping is so simple that we really only need a single function to handle it, right? Not
necessarily. Often an object can provide its own swapping method that is considerably faster
than calling the object’s constructor, assignment operator (twice), and destructor. In fact, with
STL, there are many specialized swap routines, including string::swap, list::swap, and so forth.

Sw ap Algorithm s

12.00
10.00
8.00
6.00
4.00
2.00
0.00
int complex string bitmap

MySw ap 1.00 1.00 1.00 1.00


STL sw ap 1.00 1.00 6.53 1.00
T::sw ap 7.17 8366.88

As you can see, calling STL swap is the same as using the MySwap algorithm above.
However, for specialized classes, like strings, swapping has been optimized to be 6-7 times
faster. For the bitmap object, which has extremely expensive constructors and assignment
operators, the bitmap::swap routine is over 8000 times faster!
(see Swap project for benchmark code)

2: C++ Design Considerations

When you start working on your next game and begin to think about coding conventions,
compilers, libraries, and general C++ issues, there are many factors to consider. In the
following section I weigh some performance issues involved with C++ design considerations.
2.1 STL Containers
Take advantage of STL containers. (See Appendix A for an STL container efficiency table).
Not only is performance good today, it’s only going to get better as STL vendors focus their
efforts on optimization and compiler vendors improve template compilation. There are a
number of other advantages to using STL:
1) It’s a standard. The programmers modifying your code in the future won’t have to decipher
the semantics of yet another linked list class.
2) It’s thin. Some have claimed that template-based containers cause code bloat, but I believe
template-based containers will actually make your code smaller. You won’t have to have an
entirely different set of code for different types of lists. If you’re still concerned about code
bloat, use the STL containers to store pointers instead of object copies.
3) It’s flexible. All containers follow the same conventions, so if you decide that maybe a
deque will give better performance than a list, it’s easy to switch. If you use typedefs, it can
be as easy as changing one line of code. You also get the advantage of dozens of
predefined algorithms for searching and sorting that work for any STL container.
4) It’s already written. And debugged, and tested. No guarantees, but better than starting from
scratch.
The STL is not the be-all end-all library of containers and algorithms. You can get better
performance by writing your own containers. For instance, by definition, the STL list object
must be a doubly-linked list. In cases where a singly-linked list would be fine, you pay a
penalty for using the list object. This table shows the difference between Microsoft’s (actually
Dinkumware’s) implementation of lists and SGI’s implementation of an STL-compatible singly-
linked list called slist.

List Insertion

1.50

1.00

0.50

0.00
int complex string bitmap

list begin/end 1.00 1.00 1.00 1.00


slist begin 1.28 1.34 1.32 1.00
slist end 0.13 0.06 0.13 0.09

Inserting at the beginning of a singly-linked list (slist) is around 30% faster for most objects
than inserting into the standard list. If you needed to insert items at the end of a list, slist is not
the best choice for obvious reasons.
One other drawback of the STL is that it only provides a limited set of container objects. It does
not provide hash tables, for instance. However, there are a number of good extension STL
libraries available. SGI distributes an excellent STL implementation with a number of useful
containers not defined in the standard.
(see SinglyLinkedList project for benchmark code)
2.2: References Instead of Pointers
As a basic design premise, consider using references instead of pointers. A quick example for
comparison:
int x;
void Ptr(const int* p) { x += *p; }
void Ref(const int& p) { x += p; }

The Ptr function and the Ref function generate exactly the same machine language. The
advantages of the Ref function:
• There’s no need to check that the reference is not NULL. References are never NULL.
(Creating a NULL reference is possible, but difficult).
• References don’t require the * dereference operator. Less typing. Cleaner code.
• There’s the opportunity for greater efficiency with references. A major challenge for
compiler writers is producing high-performance code with pointers. Pointers make it
extremely difficult for a compiler to know when different variables refer to the same location,
which prevents the compiler from generating the fastest possible code. Since a variable
reference points to the same location during its entire life, a C++ compiler can do a better
job of optimization than it can with pointer-based code. There’s no guarantee that your
compiler will do a better job at optimization, but it might.

2.3: Two-Phase Construction


An object with one-phase construction is fully “built” with the constructor. An object with two-
phase construction is minimally initialized in the constructor and fully “built” using a class
method. Frequently copied objects with expensive constructors and destructors can be serious
bottlenecks and are great candidates for two-phase construction. Designing your classes to
support two-phase construction, even if internally they use one-phase, will make future optimi-
zations easy.
The following code shows two different objects, OnePhase and TwoPhase, based on a Bitmap
class. They both have the same external interface. Their internals are quite different. The
OnePhase object is fully initialized in the constructor. The code for OnePhase is very simple.
The code for TwoPhase, on the other hand, is more complicated. The TwoPhase constructor
simply initializes a pointer. The TwoPhase methods have to check the pointer and allocate the
Bitmap object if necessary.
class OnePhase
{
private:
Bitmap m_bMap; // Bitmap is a "one-phase" constructed object
public:
bool Create(int nWidth, int nHeight)
{ return (m_bMap.Create(nWidth, nHeight)); }
int GetWidth() const { return (m_bMap.GetWidth()); }
};
class TwoPhase
{
private:
Bitmap* m_pbMap; // Ptr lends itself to two-phase construction
public:
TwoPhase() { m_pbMap = NULL; }
~TwoPhase() { delete m_pbMap; }
bool Create(int nWidth, int nHeight)
{
if (m_pbMap == NULL)
m_pbMap = new Bitmap;
return (m_pbMap->Create(nWidth, nHeight));
}
int GetWidth() const
{ return (m_pbMap == NULL ? 0 : m_pbMap->GetWidth()); }
};

What kind of savings can you expect? It depends. If you copy many objects, especially “empty”
objects, the savings can be significant. If you don’t do a lot of copying, two-phase construction
can have a negative impact, because it adds a new level of indirection.
(see TwoPhase project for benchmark code)

2.4: Exception Handling


Exceptions are a great way to deal with unexpected errors. But they’re expensive. Scott
Meyers notes that throwing an exception is about “three orders of magnitude slower” than a
normal return. Programs using exceptions are about “5-10% larger and 5-10% slower.”
There are a couple of options. One is avoiding exception handling altogether. As more and
more core libraries make use of exceptions, this is getting harder and harder to do, but it is an
option nonetheless. Avoiding exceptions means not using try, throw or catch in your code
or in library code. Use operator new with the no_throw specification. Turn off exception
handling in the compiler itself.
I believe the judicious use of exceptions is the best solution. Limit try blocks to a few key
places and make use of the throw() function exception specification to indicate functions that
don’t throw exceptions. That way the compiler won’t add code for unwinding stack objects
unless it really needs to.

2.5: Runtime Type Identification


Runtime type identification allows you to programmatically get information about objects and
classes at runtime. For instance, given a pointer to a base class, you can use RTTI to
determine exactly which type of class the object really is. The dynamic_cast operator relies
on RTTI to perform the proper casting of objects.
In order for RTTI to work, a program must store information about every class with one or more
virtual functions. This information is stored in a type_info object. If your project includes
many classes, the overhead can make your program larger. Runtime performance is not
affected by RTTI.
Very few programs need to use RTTI. Don’t use it unless you need it. Simply avoid the
dynamic_cast and typeid operators. Make sure that RTTI is disabled on the compiler
command line as well.
2.6: Stream I/O vs. Printf
C++ stream I/O is a very flexible and safe method of doing input and output. It allows you to
define specific output formats for your own objects. If an object doesn’t support output, the
compiler will tell you. Our old friend printf, on the other hand, is not very safe. If you specify the
wrong number of parameters, or give the wrong order, you crash. You can’t define new output
formats, either. But printf does have a few things going for it: it’s fast, it’s easy to use, and it’s
often easier to read than long lines of << operators. Consider the following two code examples.
// C++ stream io
cout << 'a' << ' ' << 1234 << ' ' << setiosflags(ios::showpoint)
<< setiosflags(ios::fixed) << 1234.5678 << ' ' << setiosflags(ios::hex)
<< &i << ' ' << "abcd" << '\n';

// stdio
printf("%c %d %f %p %s\n", 'a', 1234, 1234.5678, &i, "abcd");

Stream IO vs Printf

1.20
1.00
0.80

0.60
0.40
0.20

0.00
cout printf

speed 1.00 1.15

Both examples display the same results, but printf does it more efficiently and more readably.
Use the <cstdio> family of functions instead of the <iostream> family when output speed is
critical.
(see StreamIOvsPrintf project for benchmark code)

2.7: Alternative Libraries


As the stream I/O example shows, there’s always the possibility that another library will be
more efficient. It can pay big dividends to consider alternative libraries, whether they’re 3D,
graphics, mathematical or I/O libraries. The libraries that came with your compiler are probably
not the most efficient code available. Evaluate the possibilities. Benchmark results. Report
your findings to the game development community. And don’t forget to wrap and protect your
code so that changing low-libraries is as painless as possible.

3: C++ Optimizations You Can Do “As You Go”

Defy the software engineering mantra of “optimization procrastination.” These techniques can
be added to your code today! In general, these methods not only make your code more
efficient, but increase readability and maintainability, too.
3.1: Pass Class Parameters by Reference
Passing an object by value requires that the entire object be copied (copy ctor), whereas
passing by reference does not invoke a copy constructor, though you pay a “dereference”
penalty when the object is used within the function. This is an easy tip to forget, especially for
small objects. As you’ll see, even for relatively small objects, the penalty of passing by value
can be stiff. I compared the speed of the following functions:
template <class T> void ByValue(T t) { }
template <class T> void ByReference(const T& t) { }
template <class T> void ByPointer(const T* t) { }

Pass by Reference

8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
int complex string bitmap

by value 1.00 1.00 1.00 1.00


by reference 0.97 1.37 28.14 16682.60
by pointer 0.97 1.38 28.24 16742.18

For strings, passing by reference is almost 30 times faster! For the bitmap class, it’s thousands
of times faster. What is surprising is that passing a complex object by reference is almost 40%
faster than passing by value. Only ints and smaller objects should be passed by value,
because it’s cheaper to copy them than to take the dereferencing hit within the function.
(see PassByReference project for benchmark code)

3.2: Postpone Variable Declaration as Long as Possible


In C, all variables must be declared at the top of a function. It seems natural to use this same
method in C++. However, in C++, declaring a variable can be an expensive operation when
the object has a non-trivial constructor or destructor. C++ allows you to declare a variable
wherever you need to. The only restriction is that the variable must be declared before it’s
used. For maximum efficiency, declare variables in the minimum scope necessary, and then
only immediately before they’re used.
// Declare Outside (b is true half the time)
T x;
if (b) x = t;
// Declare Inside (b is true half the time)
if (b) T x = t;
Postpone Declaration

3.00
2.50

2.00
1.50

1.00
0.50

0.00
int complex string bitmap

Declare Outside 1.00 1.00 1.00 1.00


Declare Inside 1.02 2.84 1.50 1.01

Without exception, it’s as fast or faster to declare the objects within the scope of the if
statement. The only time where it may make sense to declare an object outside of the scope
where it’s used is in the case of loops. An object declared at the top of a loop is constructed
each time through the loop. If the object naturally changes every time through the loop, declare
it within the loop. If the object is constant throughout the loop, declare it outside the loop.
(see PostponeDeclaration project for benchmark code)

3.3: Prefer Initialization over Assignment


Another C holdover is the restriction that variables must be defined and then initialized. With
C++, this no longer applies. In fact, it’s to your advantage to initialize a variable at the moment
it’s declared. Initializing an object invokes the object’s copy constructor. That’s it. Defining and
then assigning an object invokes both the default constructor and then the assignment
operator. Why take two steps when one will do?
This recommendation nicely complements postponing declarations. In the ideal case,
postpone your declaration until you can do an initialization.
// Initialization
T x = t; // alternately T x(t); either one invokes copy ctor
// Assignment
T x; // default ctor
x = t; // assignment operator
Prefer Initialization

5.00

4.00

3.00

2.00

1.00

0.00
int complex string bitmap

Assignment 1.00 1.00 1.00 1.00


Initialization 1.04 4.26 1.06 1.00

Initializing a complex value is over four times faster than declaring and assigning. Even for
strings, the gain is 6%. Surprisingly, it makes little difference for the bitmap object. That’s
because the time to default construct a bitmap is miniscule in comparison to the time required
to copy one bitmap to another.
Here’s a real world case from WON, the company where I work. This is code that’s running
today – slightly modified to protect the guilty. It probably looks similar to code in your own
projects. The input strings are copied to slightly different string objects.
void SNCommGPSendNewUser(const SNstring& sUser, const SNstring& sPass,
/* 9 more SNstring params ... */ )
{
string User;
string Pass;
User = sUser; // Convert to our format
Pass = sPass;
// etc . . .
}

Here’s the code revised to use the initialization technique.


void SNCommGPSendNewUser(const SNstring& sUser, const SNstring& sPass,
/* 9 more SNstring params ... */ )
{
// Convert to our format
string User = sUser;
string Pass = sPass;
// etc . . .
}

Readability improvement: 100%. Lines of code: 50% of original. Speed improvement: just over
3%. Not huge, but certainly nothing to complain about. Triple win.
(see PreferInitialization project for benchmark code)
3.4: Use Constructor Initialization Lists
In any constructor that initializes member objects, it can pay big dividends to set the objects
using an initialization list rather than within the constructor itself. Why? Class member
variables are automatically constructed using their default constructor prior to entry within the
class constructor itself. You can override this behavior by specifying a different member
constructor (usually a copy constructor) in the initialization list. Multiple initializations are
separated with commas (not shown here).
template <class T> class CtorInit
{
T m_Value;
public:
// no list
CtorInit(const T& t) // m_Value default ctor called here automatically
{
m_Value = t; // m_Value assignment operator called
}
// with list
CtorInit(const T& t) : m_Value(t) { } // m_Value copy ctor called
};

The drawback to using initialization lists is that there’s no way to do error checking on incoming
values. In the “no list” example we could do some validation on t within the CtorInit function. In
the “with list” example, we can’t do any error checking until we’ve actually entered the CtorInit
code, by which time t has already been assigned to m_Value. There’s also a readability
drawback, especially if you’re not used to initialization lists.

Initialization Lists

3.00

2.00

1.00

0.00
int complex string bitmap

no list 1.00 1.00 1.00 1.00


w ith list 1.01 3.10 1.10 1.01

Nevertheless, these are good performance gains, particularly for the complex object. This type
of performance can outweigh the drawbacks.
(see InitializationLists project for benchmark code)

3.5: Use Operator= Instead of Operator Alone


One of the great things about C++ is the ability to define your own operators. Rather than
coding strC = strA; strC.append(strB); you can code strC = strA + strB. The notion of
“appending” is conveyed much more simply and precisely using operator +. One thing to keep
in mind, however, is that operator + returns a temporary value that must be both constructed
and destructed, whereas operator += modifies an existing value. In fact, operator + can usually
be defined in terms of operator +=.
T T::operator + (const T& t)
{
T result(*this); // temporary object
return (result += t);
}

It’s typically more efficient to use += instead of + alone, because we avoid generating a
temporary object. Consider the following functions. They give the same result, but one uses +
alone and the other uses +=
template <class T> T OperatorAlone(const T& a, const T& b)
{
T c(a + b);
return (c);
}
template <class T> T OperatorEquals(const T& a, const T& b)
{
T c(a);
c += b;
return (c);
}

Operator =

1.50

1.00

0.50

0.00
int complex string bitmap

operator 1.00 1.00 1.00 1.00


operator = 0.93 0.98 1.23 1.37

For intrinsic types, + alone gives better results, but for non-trivial classes, especially classes
with costly construction time, += is the better choice.
(see OperatorEquals project for benchmark code)

3.6: Use Prefix Operators


Objects that provide the concept of increment and decrement often provide the ++ and - -
operators. There are two types of incrementing, prefix (++x) and postfix (x++). In the prefix
case, the value is increased and the new value is returned. In the postfix case, the value is
increased, but the old value is returned. Correctly implementing the postfix case requires
saving the original value of the object – in other words, creating a temporary object. The postfix
code can usually be defined in terms of the prefix code.
const T T::operator ++ (int) // postfix
{
T orig(*this);
++(*this); // call prefix operator
return (orig);
}
The clear recommendation: avoid postfix operators. In fact, you may want to declare the
postfix operators in the private section of the class so that the compiler will flag incorrect usage
for you automatically.

Prefer Prefix

3.00
2.50
2.00
1.50
1.00
0.50
0.00
int complex bitmap

x++ (post) 1.00 1.00 1.00


++x (pre) 1.01 1.47 176.01

Strings aren’t included in the results because increment doesn’t make sense for strings. (It
doesn’t really make sense for bitmaps, either, but I defined increment to increase the width and
height by one, forcing a reallocation.) Where this recommendation really shines is for mathe-
matical objects like complex. The prefix operator is almost 50% faster for complex objects.
(see PreferPrefix project for benchmark code)

3.7: Use Explicit Constructors


Explicit is a recent keyword added to C++ and now part of the language standard. This
keyword solves the following potential problem. Suppose you have the following class:
class pair
{
double x, y;
public:
pair(const string& s) { . . . }
bool operator == (const pair& c) { . . . }
}

Now suppose you do the following comparison, either purposefully or accidentally:


pair p;
string s;
if (p == s) { . . . }

Your compiler is pretty smart. It knows how to compare two pairs because you told it how in
the pair class. It also knows how to create a pair given a string, so it can easily evaluate
(p == s). The drawback is that we’ve hidden the second pair constructor – it’s implicit. If that
constructor is expensive, it’s difficult to see that’s it’s being invoked. Worse, if we made a
mistake and we didn’t really want to compare a pair with a string, the compiler won’t tell us.
My advice: make all single-argument constructors (except the copy constructor) explicit.
explicit pair(const string& s) { . . . }

Now the (c == s) line will give a compiler error. If you really want to compare these guys, you
must explicitly call the constructor:
if (p == pair(s)) { . . . }

Using explicit will protect you from stupid mistakes and make it easier for you to pinpoint
potential bottlenecks.

4: C++ Final Optimizations

Your game is up and running. The data structures are ideal, the algorithms sublime, the code
elegant, but the game – well, it’s not quite living up to its potential. Time to get drastic, and with
drastic measures, there are tradeoffs to consider. These optimizations are going to make your
code less modular, harder to understand, and more difficult to maintain. They may cause
unexpected side effects like code bloat. Your compiler may not even be able to handle some of
the more advanced template-based techniques. Proceed with caution. Arm yourself with a
good profiler.

4.1: Inline Functions


The inline keyword is an extremely simple and powerful means of optimizing C++ programs.
In one oft-cited case, the C++ standard library sort ran 7 times faster than qsort on a test of
500,000 elements because the C++ version was inlined. On the other hand, inline functions
are also overused, and the consequences are significant. Inline indicates to the compiler that a
function may be considered for inline expansion. If the compiler chooses to inline a function,
the function is not called, but copied into place. The performance gain comes in avoiding the
function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware! Inline functions are not free. They can increase program size. They can increase
execution time by reducing the caller’s locality of reference. When sizes increase, the caller’s
inner loop may no longer fit in the processor cache, causing unnecessary cache misses and
the consequent performance hit. Inline functions also increase build times – if inline functions
change, the world must be recompiled. Some guidelines:
• Avoid inlining functions until profiling indicates which functions could benefit from inline.
• Consider using your compiler’s option for auto-inlining after profiling both with and without
auto-inlining.
• Only inline functions where the function call overhead is large relative to the function’s
code. In other words, inlining large functions or functions that call other (possibly inlined)
functions is not a good idea.
Sample code and relative performance:
void a() {}
int b(int i) { return (i); }
int c(int i, string& s, const complex<double>& c, const Bitmap& b)
{
s = b.GetName();
return (i += static_cast<int>(c.real()));
}
class Eval
{
int n; string s; complex<double> c; Bitmap b;
public:
int GetInt() const { return (n); }
complex<double> GetCp() const { return (c); }
string GetStr() const { return (s); }
Bitmap GetBmp() const { return (b); }
};

Inline Functions

2.50

2.00

1.50

1.00

0.50

0.00
a() b() c() GetInt() GetCp() GetStr() GetBmp()

not inline 1.00 1.00 1.00 1.00 1.00 1.00 1.00


inline 1.46 1.75 1.00 2.09 2.36 0.97 1.00

The biggest gain is inlining the function that returns the complex value. Inlining that function
increased performance over two times the non-inlined version. Inlining the larger functions or
the functions that returned non-trivial objects (strings, bitmaps) did not improve performance at
all.
Note that using Microsoft’s inline “any suitable” function option did not inline any other
functions other than those already marked inline, even the world’s most simple function a()!
Clearly, there’s room for improvement in the Visual C++ compiler.
(see Inline project for benchmark code)

4.2: Avoid Temporary Objects: the Return Value Optimization


C++ is a powerful language, but the power behind programmer-defined objects and
guaranteed object initialization and destruction is not without a price. In many cases, the price
you pay is what’s called “hidden temporary objects.” C++ merrily builds and destroys objects in
many non-obvious places. Some of the performance techniques already listed above are
methods of avoiding temporaries. For instance, passing objects by reference instead of by
value avoids temporaries. Using ++x instead of x++ avoids temporaries. Using the explicit
keyword helps avoid hidden temporaries.
Here’s one more tip for avoiding temporaries. It’s called return value optimization. The best
way of showing how it works is through an example. Mult multiplies two complex values and
returns the result.
complex<double> Mult(const complex<double>& a, const complex<double>& b)
{
complex<double> c;
double i = (a.real() * b.imag()) + (a.imag() * b.real());
double r = (a.real() * b.real()) - (a.imag() * b.imag());
c.imag(i);
c.real(r);
return (c);
}

The code is correct, but it could be more efficient. We already know we can improve the
efficiency by initializing the complex value when it’s constructed:
complex<double> Mult(const complex<double>& a, const complex<double>& b)
{
complex<double> c((a.real() * b.real()) - (a.imag() * b.imag()),
(a.real() * b.imag()) + (a.imag() * b.real()));
return (c);
}

Now let’s take it one step further:


complex<double> Mult(const complex<double>& a, const complex<double>& b)
{
return (complex<double>((a.real() * b.real()) + (a.imag() * b.imag()),
(a.real() * b.imag()) - (a.imag() * b.real())));
}

At this point, the compiler can work a little magic. It can omit creating the temporary object that
holds the function return value, because that object is unnamed. It can construct the object
defined by the return expression inside the memory of the object that is receiving the result.
Return value optimization is another way of saying return constructor arguments instead of
named objects. What kind of gains can you expect from this optimization? Here are some
trivial example functions I used to evaluate potential performance improvements.
template <class T> T Original(const T& tValue)
{
T tResult; // named object; probably can’t be optimized away
tResult = tValue;
return (tResult);
}
template <class T> T Optimized(const T& tValue)
{
return (T(tValue)); // unnamed object; optimization potential high
}
Return Value Optimization

4.00

3.00

2.00

1.00

0.00
int complex string bitmap cMult

original 1.00 1.00 1.00 1.00 1.00


optimized 1.00 1.02 2.07 3.41 1.03

The results are favorable. String performance doubled and bitmap performance more than
tripled. Before you go using this trick willy-nilly, be aware of the drawbacks. It’s hard to check
for errors in intermediate results. In the first version of Mult above, we could easily add error
checking on the real and imaginary values. In the final version, all we can do is hope there’s no
overflow or underflow errors. The first version is also much easier to read. In the final version,
there could be a non-obvious error. For instance, is the real part the first parameter to the
complex constructor, or is the imaginary part?
One more note. The final version of the C++ standard has made it easier for compilers to opti-
mize away even named objects. The standard says that for functions with a class return type, if
the return statement is the name of a local object, and the object is the same type as the return
type, the compiler can omit creating a temporary object to hold the function return value. That
means that in some rosy future years away, return value optimization will be in the hands of
compiler writers where it should be, and we won’t have to change our code. And pigs could fly.
In the meantime, this tip is worth considering.
(see ReturnValueOpt project for benchmark code)

4.3: Avoid Virtual Functions


This recommendation should really be called “Use Virtual Functions, But Be Aware of the
Cost.” Much has been made about the overhead of virtual functions. A single virtual function in
a class requires that every class object and derived objects contain an extra pointer. For
simple classes, this can as much as double the size of the object. Creating virtual objects costs
more than creating non-virtual objects, because the virtual function table must be initialized.
And it takes slightly longer to call virtual functions, because of the additional level of indirection.
However, when you really need virtual functions, you’re unlikely to develop a faster mechanism
than the virtual function dispatching built into C++. Think about it. If you needed to track an
object’s type, you’d have to store a type flag in the class, adding to the size of the object. To
duplicate virtual functions, you’d probably have a switch statement on the flag, and the virtual
function table is faster – not to mention less error prone and much more flexible – than a
switch statement.
The following code shows a worst case scenario, a tiny object with trivial functions. The Virtual
object is 8 bytes, and the NonVirtual object is 4 bytes.
class Virtual // sizeof(Virtual) == 8
{
private: int mv;
public:
Virtual() { mv = 0 }
virtual ~Virtual() {}
virtual int foo() const { return (mv); }
};
class NonVirtual // sizeof(NonVirtual) == 4
{
private: int mnv;
public:
NonVirtual() { mnv = 0 }
~NonVirtual() {}
int foo() const { return (mnv); }
};

Virtual Function Overhead

1.00

0.80

0.60

0.40

0.20

0.00
ctor/dtor foo

NonVirtual 1.00 1.00


Virtual 0.73 0.96

Construction/destruction time shows the performance penalty of initializing the virtual function
table. (See the Microsoft-specific tip below for reducing this penalty). Notice that the function
call overhead is very minimal, even though the function itself hardly does anything. For larger
functions, the overhead becomes even less significant.
(see VirtualFunctions project for benchmark code)

4.4: Return Objects Via Reference Parameters


The cleanest way to return an object from a function is returning the object by value.
template <class T> T ByValue()
{
T t = // get t from file, computation, whatever
return (t);
}

When you’re writing a function, the above method is preferable about 99.9% of the time. You
should almost never return an object by reference. That’s why this tip is entitled return objects
via reference parameters, not return object by reference.
template <class T> T& ByRef() // RED ALERT!
{
T t = // get t from file, computation, whatever
return (t); // DON’T DO THIS!
}

When t goes out of scope, it’s destroyed and its memory is returned to the stack. Any
reference to the object is invalid after the function has returned! A good compiler will warn you
about this type of mistake. You can, however, pass your return value as a non-const reference
function parameter, and in some cases see improved performance.
template <class T> void ByReference(T& byRef)
{
// get t from file, computation, whatever
byRef = t;
}

You’ll only see a performance improvement if the object you’re passing in as the desired return
value (byRef) is being reused and hasn’t been simply declared immediately prior to calling in
the function. In other words,
T t;
for( . . . )
ByReference(t);

may be faster, because t is being reused every time through the loop, and doesn’t have to be
reconstructed and redestroyed at each iteration, while
T t;
ByReference(t);

is exactly the same as returning the object by value, and even worse, doesn’t lend itself to
possible return value optimizations. There’s another good reason to avoid this suggestion – it
can make your code very hard to read.

Return by Reference

2.50

2.00

1.50

1.00

0.50

0.00
int complex string bitmap

by value 1.00 1.00 1.00 1.00


by ref 0.71 2.22 1.43 1.90

The results above show the times spent within ByValue and ByReference. These numbers are
slightly misleading, because they don’t show any construction time for T in the ByReference
case, and T certainly must be constructed somewhere prior to calling ByReference.
Nevertheless, the results show that performance gains may be significant in limited cases.
(see ReturnByValOrRef project for benchmark code)
4.5: Per-class Allocation
One feature of C++ that shows the true power and flexibility of the language is the ability to
overload new and delete at the class level. For some objects, this power can give you
incredible speed improvements. To see what kind of performance improvements I could get, I
derived my own string class from the standard string class and tried various allocation
schemes. I wasn’t too successful, namely because standard new and delete are already pretty
dang fast, and because string class wasn’t a really good choice. The objects that will see the
most improvement are the objects you use in a specific way that can truly benefit from a
custom allocation scheme.
I used an approach called the “memory pool” method. Using individual pools for memory
allocation is beneficial because it improves locality and you can optimize knowing that all
objects in the pool are the same size.
My new string class looked like this:
template <class Pool> class MString : public string
{
public :
MString() {}
virtual ~MString() {}
void* operator new(size_t nBytes)
{
return (GetPool().Allocate(nBytes));
}
void operator delete(void* pFree)
{
GetPool().Deallocate(pFree, sizeof(MString));
}
private:
static Pool& GetPool() { static Pool p; return (p); }
};

I tried some different types of pools. MemPool uses a heap block and a free list. StackPool
uses a chunk of the stack for the pool. HeapPool uses one heap block and no free list.
typedef MString<MemPool> FLStr; // heap block w/ free-list
typedef MString<StackPool<POOL_SIZE> > SStr; // stack-based
typedef MString<HeapPool> HStr; // single heap block

The StackPool object is shown below.


template <int PoolSize> class StackPool
{
private:
unsigned char m_pPool[PoolSize]; // the pool
size_t m_nBytesAllocated;
public:
StackPool() { m_nBytesAllocated = 0; }
~StackPool() {}
void* Allocate(size_t nBytes) throw (std::bad_alloc)
{
void* pReturn = m_pPool + m_nBytesAllocated;
m_nBytesAllocated += nBytes;
return (pReturn);
}
void Deallocate(void* pFree, size_t nBytes) {}
};
Overload New and Delete

1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
string Str FLStr SStr HStr

performance 1.00 0.96 0.78 1.20 0.57

The stack pool gave the best performance. It was faster than the default new and delete
implementations by 20%. It was sorely limited, though. After a certain number of allocations, it
would crash because there was no more room in the pool. StackPool isn’t complex enough to
say “OK everybody – out of the pool.” Profile your code. If you find that new or delete is a
bottleneck for certain specific objects, consider overloading new and delete. However, you
might find your money is better invested in a memory management library that will improve
your performance across the board.
(see OverloadNewDelete project for benchmark code)

4.6: Use STL Allocators


STL containers provide a method similar to overloading new and delete that allow you to cus-
tomize the allocation behavior of a container. Every container takes as a template parameter
an “allocator” object. This template parameter defaults to the standard allocator provided by
the STL implementation. In most cases, the standard allocator class uses new and delete in-
ternally. You can always write your own allocators if you think you can do a better job, or, more
likely, you have a container that you use in such a way that lends itself to custom allocation.
The following class is a custom allocator object called MyAlloc. It is based on a standard
allocator object and can be used in place of a standard allocator.
template <class T, class Pool> class MyAlloc
{
public:
// Standard typedefs and functions omitted for brevity . . .

// Our specializations
pointer allocate(size_type nBytes, const void* /*pHint*/)
{
if (nBytes < 0) nBytes = 0;
return ((pointer) GetPool().Allocate(nBytes * sizeof(T)));
}
void deallocate(void _FARQ* pFree, size_type nBytes)
{
GetPool().Deallocate(pFree, nBytes);
}
private:
static Pool& GetPool() { static Pool p; return (p); }
};
The key functions are allocate and deallocate, which perform the bulk of the work. If you
compare these functions to the overloaded new and delete operators in the previous example,
you’ll see that they’re very similar. Here’s an example of using MyAlloc as the allocator for a
list container. I use the HeapPool object mentioned above as the pool for this allocator.
typedef MyAlloc<int, HeapPool> NHeapPoolAlloc;
list<int, NHeapPoolAlloc> IntList;

To compare efficiency, I evaluated the speed of inserting items into a list using the default
allocator and NHeapPoolAlloc.

STL Allocators

6.00
5.00

4.00
3.00

2.00
1.00

0.00
int complex string bitmap

standard 1.00 1.00 1.00 1.00


specialized 1.94 0.91 0.16 6.18

The results are all over the place. Strangely enough, there was significant improvement for
ints. On the other hand, string performance plummeted. Just goes to show how important it is
to evaluate your “optimization” once it’s in place.
(see STLAllocators project for benchmark code)

4.7: The Empty Member Optimization


Most programmers are surprised to discover that empty classes still require memory. For
instance, suppose we define an allocator-type class similar to STL allocators. This class
doesn’t contain any data – just a couple of functions.
template <class T> class Allocator (sizeof(Allocator) == 1)
{
public :
static T* allocate(int n) { return ((T*) operator new(n * sizeof(T))); }
static void deallocate(void* p) { operator delete(p); }
};

If we declare this object, it requires one byte of memory (sizeof(Allocator) == 1), because the
C++ standard requires that we be able to address its location in memory. If we declare this
object within another class, the compiler byte-alignment settings come into play. The class
below requires 8 bytes of memory if we’re 4-byte aligned.
template <class T, class Alloc = Allocator<T> >
class ListWithAllocMember // (sizeof(ListWithAlocMember) == 8)
{
private :
Alloc m_heap;
Node* m_head;
};
This storage requirement has serious ramifications. This list object requires twice the size it
really needs. Fortunately, the C++ standard provides a workaround. It says that “a base class
subobject of an empty class type may have zero size.” In other words, if we derive our class
from the empty class, the empty class overhead disappears. This is the empty member optimi-
zation. The class below is 4 bytes.
template <class T, class Alloc = Allocator<T> >
class ListWithAllocBase : private Alloc // (sizeof(ListWithAllocBase) == 4)
{
private :
Node* m_head;
};

Deriving from the empty class is not really an ideal solution. There are some cases when it’s
no solution at all. Here’s a better one. We can declare an internal data member derived from
the empty class.
template <class T, class Alloc = Allocator<T> >
class ListWithEmptyMemberAlloc (sizeof(ListWithEmptyMemberAlloc) == 4)
{
private :
struct P : public Alloc
{
Node* m_head;
};
P m_heap;
};

Now there’s an additional level of indirection within the class itself (i.e. we have to use
m_head.allocate notation instead of allocate()), but our list is still only 4 bytes large and we
have all the advantages of the allocator object. A Watcom engineer reported that STL bench-
marks ran 30% faster after their compiler team implemented the empty-base optimization.
(see EmptyMember project for benchmark code)

4.8: Template Metaprogramming


Recently there’s been a growing awareness of an unforeseen template capability – using
templates as pre-compilers. By embedding a call to a template within itself, a programmer can
force a compiler to recursively generate code. A good compiler can optimize away all the
recursion, because all template parameters must be known at compile time. The result?
Massive speed improvements. Here’s a function which recursively generates the factorial of n.
int RecursiveFactorial(int n)
{
return ((n <= 1) ? 1 : (n * RecursiveFactorial(n - 1)));
}

Now look at a template-based version that does the same thing where n is passed as the
template parameter.
template <int N> class Factorial
{
public :
// Recursive definition
enum { GetValue = N * Factorial<N - 1>::GetValue };
};
// Specialization for base case
template <> class Factorial<1>
{
public :
enum { GetValue = 1 };
};

And some example calls, including a call to a non-recursive version.


const int i = 10;
printf("%d\n", RecursiveFactorial(i));
printf("%d\n", NonrecursiveFactorial(i));
printf("%d\n", Factorial<i>::GetValue);

For this particular case, the template version is 130 times as fast as the non-template version.
The Microsoft compiler optimizes away all of the template recursion, implementing the call as a
single move instruction. Now that’s the kind of optimization I like to see.

Factorial

120.00
100.00
80.00
60.00
40.00
20.00
0.00
recursive non-recursive template

factorial 1.00 4.69 131.00

Nice results, but not a terribly useful function. The template parameter must be known at
compile time, so it’s not very flexible. For some things, though, that’s not really a limitation.
Suppose you wanted to have an inline sine table. You could use template metaprogramming to
create a class that computed the sine of a number using a series expansion. I did just that. The
code is complex, so I leave it as an exercise for the reader. (Actually, it’s in the accompanying
TemplateMetaprogramming source file). I compared my template-based sine to the C runtime
version, a table-based function, and a non-template function that used series expansion.

Sine

2.00

1.50

1.00

0.50

0.00
CRT table-based series-exp template

sine 1.00 1.84 0.20 0.08


In this case, the template-based version performed miserably, when it should have been
roughly the same speed as the table-based version. It turned out that the compiler had not
optimized the template-based series expansion at all, but had implemented it as a massive set
of embedded function calls. In theory, it should have done the same thing as the factorial
example and generated a single move instruction. In practice, it punted when confronted with
too much complexity. Consider yourself warned. Compilers don’t always do what you want.
Your only friend is a profiler.
One more example. Microsoft provides a set of 3D matrix functions with its Direct3D dev kit in
d3dutils.cpp. One of those functions performs matrix multiplication. It looks like this:
D3DMATRIX MatrixMult(const D3DMATRIX& a, const D3DMATRIX& b)
{
D3DMATRIX ret = ZeroMatrix();
for (int i=0; i < 4; i++)
for (int j=0; j < 4; j++)
for (int k=0; k < 4; k++)
ret(i, j) += a(k, j) * b(i, k);
return ret;
}

We can replace this function with a template-based version that unrolls the loops, completely
eliminating the loop overhead:
template <int I=0, int J=0, int K=0, int Cnt=0> class MatMult
{
private :
enum
{
Cnt = Cnt + 1,
Nextk = Cnt % 4,
Nextj = (Cnt / 4) % 4,
Nexti = (Cnt / 16) % 4,
go = Cnt < 64
};
public :
static inline void GetValue
(D3DMATRIX& ret, const D3DMATRIX& a, const D3DMATRIX& b)
{
ret(I, J) += a(K, J) * b(I, K);
MatMult<Nexti, Nextj, Nextk, Cnt>::GetValue(ret, a, b);
}
};
// specialization to terminate the loop
template <> class MatMult<0, 0, 0, 64>
{
public :
static inline void GetValue
(D3DMATRIX& ret, const D3DMATRIX& a, const D3DMATRIX& b) { }
};

The template-based version is called like this:


D3DMATRIX a, b;
D3DMATRIX result = ZeroMatrix();
MatMult<>::GetValue(result, a, b);

The template could be more flexible by providing a dimension parameter, but I left that out for
simplicity’s sake. The template function calculates the next values of i, j, and k and recursively
calls itself with a new count, which goes from 0 to 64. To terminate the loop, there’s a template
specialization that just returns. With a good compiler, the code generated should be as efficient
as writing MatrixMult like this:
D3DMATRIX MatrixMultUnrolled(const D3DMATRIX& a, const D3DMATRIX& b)
{
D3DMATRIX ret = ZeroMatrix();
ret(0,0) = a(0,0)*b(0,0) + a(1,0)*b(0,1) + a(2,0)*b(0,2) + a(3,0)*b(0,3);
ret(0,1) = a(0,1)*b(0,0) + a(1,1)*b(0,1) + a(2,1)*b(0,2) + a(3,1)*b(0,3);
. . .
ret(3,3) = a(0,3)*b(3,0) + a(1,3)*b(3,1) + a(2,3)*b(3,2) + a(3,3)*b(3,3);
return ret;
}

Matrix Multiplication

2.50

2.00

1.50

1.00

0.50

0.00
MatMult template unrolled

matrix * 1.00 1.24 2.24

Unfortunately, Microsoft’s compiler wasn’t completely up to the task, although we did see a
minor improvement over the existing version. Currently, the best performance gain comes from
rewriting the entire function with all loops unrolled.
Template metaprogramming has its advantages. It can be extremely effective for mathematical
and scientific libraries, and there’s definite possibilities for streamlining 3D math. One public
scientific computing library, called Blitz++, is based on the template-metaprogramming
concept. The performance of the library is on par with the best Fortran libraries at maximum
optimization. In other cases, performance increases of 3-20 times that of a commercial C++
linear algebra library have been achieved. However, compiler support for templates is still im-
mature. Microsoft in particular was slow to implement template functionality, and even VC 6.0
doesn’t support the full C++ standard for templates. As compilers advance, template
metaprogramming may take a larger place in optimization technology.
(see TemplateMeta project for benchmark code)

4.9: Copy-On-Write
One general method of increasing efficiency is called lazy evaluation. With lazy evaluation,
nothing is precomputed; you put off all your processing until the result is really needed. A
complementary method is called “copy-on-write.” With copy-on-write, two or more objects can
share the same data until the moment when one of those objects is changed, at which point
the data is physically copied and changed in one of the objects. C++ lends itself nicely to copy-
on-write, since it can be added without affecting a class interface. Microsoft was able to
change the internals of its own CString class to add copy-on-write functionality. Most
programmers never noticed the difference because the class interface was unchanged.
Copy-on-write requires two things: reference counting and smart pointers. A reference count
indicates the number of objects referring to the same piece of data. A smart pointer points to
an object with a reference count. When the reference count goes to zero, the smart pointer is
“smart” enough to automatically delete the object. A simple RefCount and SmartPtr class
follow. Note that DownRefCount returns true when the reference count goes to zero and it’s
safe to delete the object.
class RefCount // a mixin class
{
private:
int m_nRef; // reference count
public:
RefCount() { m_nRef = 0; }
int GetRefCount() const { return (m_nRef); }
void UpRefCount() { ++m_nRef; }
bool DownRefCount()
{
if (m_nRef > 0 && --m_nRef == 0)
return (true); // safe to remove object
return (false);
}
};

A SmartPtr acts just like a regular pointer except in two cases: when it’s copied, the reference
count is incremented, and when it’s destroyed the reference count is decremented. If the
reference count goes to zero, the object pointed to is destroyed as well.
template <class T> class SmartPtr // T must be derived from RefCount
{
private:
T* m_pCountedObj;
public:
SmartPtr() { m_pCountedObj = NULL; }
SmartPtr(const SmartPtr<T>& spCopy)
{
m_pCountedObj = NULL;
SmartCopy(spCopy);
}
SmartPtr(T* pCopy) { m_pCountedObj = NULL; SmartCopy(pCopy); }
~SmartPtr() { Destroy(); }
SmartPtr<T>& operator = (const SmartPtr<T>& spCopy)
{
if (&spCopy == this)
return (*this);
return (SmartCopy(spCopy));
}
T& operator * () const { return (*m_pCountedObj); }
T* operator -> () const { return (m_pCountedObj); }
operator T* () const { return (m_pCountedObj); }
SmartPtr<T>& SmartCopy(T* pCopy)
{
Destroy();
m_pCountedObj = pCopy;
if (pCopy != NULL)
m_pCountedObj->UpRefCount();
return (*this);
}
SmartPtr<T>& SmartCopy(const SmartPtr<T>& spCopy)
{ return (SmartCopy(spCopy.m_pCountedObj)); }
private:
void Destroy()
{
if (m_pCountedObj != NULL && m_pCountedObj->DownRefCount())
{
delete m_pCountedObj;
m_pCountedObj = NULL;
}
}
};

We create a new reference-counted bitmap class by inheriting from Bitmap and using the
RefCount mixin class.
class CountedBitmap : public Bitmap, public RefCount { };

Now we can create a “smart” bitmap class. It contains a smart pointer to a reference-counted
bitmap. Whereas copying a regular bitmap required a deleting and reallocating memory,
copying a SmartBitmap is as simple and efficient as copying a pointer. We only need to do
“expensive” operations when the bitmap is actually changed (e.g. Create’d or Destroy’ed).
class SmartBitmap
{
private :
SmartPtr<CountedBitmap> m_pBitmap;
public :
SmartBitmap() { m_pBitmap = new CountedBitmap; }
SmartBitmap(int nWidth, int nHeight, Bitmap::eDepth)
{ m_pBitmap = new CountedBitmap(nWidth, nHeight, nDepth); }
virtual ~SmartBitmap() {}
virtual bool Create(int nWidth, int nHeight, Bitmap::eDepth)
{
// if creating a multiply-referred object, time to copy
if (m_pBitmap->GetRefCount() > 1)
m_pBitmap = new CountedBitmap;
return (m_pBitmap->Create(nWidth, nHeight, nDepth));
}
virtual void Destroy()
{
// if nuking a multiply-referred object, time to copy
if (m_pBitmap->GetRefCount() > 1)
m_pBitmap = new CountedBitmap;
m_pBitmap->Destroy();
}
virtual int GetWidth() const { return (m_pBitmap->GetWidth()); }
virtual int GetHeight() const { return (m_pBitmap->GetHeight()); }
// etc. . . .
};

The disadvantage is that we’ve added another level of indirection through the smart pointer.
We’ve also required that the object constructor allocate the reference-counted object so that
the smart pointer can properly delete it. Another important note: the destructor doesn’t do
anything! That’s because the smart pointer will automatically delete the object when nobody
else is referring to it. Cool.
To evaluate performance, I compared a large set of typical bitmap operations, including
copying. SmartBitmap was six times as fast as the original “dumb” bitmap. Your results will
vary depending on how much objects are copied (slow for dumb objects, fast for smart objects)
and how much copied objects are changed (fast for dumb objects, slow for smart objects).
Copy On Write

6.00
5.00

4.00

3.00

2.00
1.00

0.00
dumb smart

bitmap 1.00 5.86

The nice thing about smart objects is that they can be returned by value with no performance
penalty. In the case above, it’s now feasible to use SmartBitmap as a return value.
(see SmartPointer project for benchmark code)

5: Compiler Optimization

A good compiler can have a huge effect on code performance. Most PC compilers are good,
but not great, at optimization. Be aware that sometimes the compiler won't perform optimiza-
tions even though it can. The compiler assigns a higher priority to producing consistent and
correct code than optimizing performance. Be thankful for small favors.

5.1: Compiler C Language Settings


The following table lists all of the MS Visual C++ 6.0 “C” optimizations for reference. Alternate
methods are given when an optimization can be specified directly in the code. Microsoft default
values for release builds are highlighted.
Name Option Description
Blend /GB Optimize for 386 and above
Pentium /G5 Optimize for Pentium and above
Pentium Pro /G6 Optimize for PentiumPro and above
Windows /GA Optimize for Windows (specifically access to thread-specific data)
DLL /GD Not currently implemented. Reserved for future use.
Cdecl /Gd Caller cleans stack. Slow. Allows variable argument functions. Alternate: __cdecl
Stdcall /Gz Callee cleans stack. Fast. No variable argument functions. Alternate: __stdcall
Fastcall /Gr Callee cleans stack. Uses registers. Fastest. No variable argument functions. Can’t
be used with _export. Alternate: __fastcall
String pooling /Gf Put duplicate strings in one memory location.
String pooling RO /GF Put duplicate strings in one read-only memory location.
Stack probes off /Gs Turn off stack checking. Alternate: #pragma check_stack
Func-level linking /Gy Linker only includes functions referenced in the OBJ rather than the entire contents
Small /O1 Same as /Og /Oy /Ob1 /Gs /Gf /Gy /Os (global opts, omit frame ptr, allow inlines,
stack probes off, func-level linking, favor code size over speed)
Fast /O2 Same as /Og /Oy /Ob1 /Gs /Gf /Gy /Oi /Ot (global opts, omit frame ptr, allow
inlines, stack probes off, func-level linking, favor code speed, intrinsic functions)
No aliasing /Oa Assume no aliasing occurs within functions. Alternate: #pragma optimize(“a”)
Intra-func aliasing /Ow Assume aliasing occurs across function calls. Alternate: #pragma optimize(“w”)
Disable all opts /Od Turn off all optimizations
Global opts /Og Turn on loop, common subexpression and register optimizations. Alternate:
Name Option Description
#pragma optimize(“g”)
Intrinsic functions /Oi Replace specific functions with inline versions (memcpy, strcpy, strlen, etc.).
Alternate: #pragma intrinsic/function
Float consistency /Op Increase the precision of floating point operations at the expense of speed and size
Small code /Os Favor code size over speed. Alternate: #pragma optimize(“s”)
Fast code /Ot Favor code speed over size. Alternate: #pragma optimize(“f”)
Full optimizations /Ox Enable the following: /Ob1 /Og /Oi /Ot /Oy /Gs
Omit frame /Oy Suppress creation of frame pointers on call stack. Frees the EBP register for other
pointer uses. Alternate: #pragma optimize(“y”)
Struct packing /Zp8 Sets the structure member alignment. n=1,2,4,8(default),16. Smaller values
generate smaller, slower code. Larger values generate larger, faster code.

5.2: Compiler C++ Language Settings


The following table lists all of the Microsoft Visual C++ 6.0 “C++” optimizations for reference.
Alternate methods are given when an optimization can be specified directly in the code.
Microsoft default values for release builds are highlighted.
Name Option Description
No Vtable __declspec Stops the compiler from generating code to initialize the vfptr in the
(novtable) constructor. Apply to pure interface classes for code size reduction.
No Throw __declspec Stops the compiler from tracking unwindable objects. Apply to functions that
(nothrow) don’t throw exceptions for code size reduction. Same as using C++ throw()
specification.
Disable RTTI /GR- Turn off run time type information.
Exception handling /GX Turn on exception handling.
Inline expansion /Ob1 Allow functions marked inline to be inline. Alternate: inline, __forceinline,
#pragma inline_depth/inline_recursion
Inline any /Ob2 Inline functions deemed appropriate by compiler. Alternate: #pragma
auto_inline/inline_depth/inline_recursion
Ctor displacement /vd0 Disable constructor displacement. Choose this option only if no class
constructors or destructors call virtual functions. Use /vd1 (default) to enable.
Alternate: #pragma vtordisp
Best case ptrs /vmb Use best case “pointer to class member” representation. Use this option if
you always define a class before you declare a pointer to a member of the
class. The compiler will issue an error if it encounters a pointer declaration
before the class is defined. Alternate: #pragma pointers_to_members
Gen. purpose ptrs /vmg Use general purpose “pointer to class member” representation (the opposite
of /vmb). Required if you need to declare a pointer to a member of a class
before defining the class. Requires one of the following inheritance
models: /vmm, /vms, /vmv. Alternate: #pragma pointers_to_members

5.3: The Ultimate Compiler Settings


The ultimate options for fast programs. Microsoft default values for release builds highlighted.
Name Option Description
Disable Vtable Init __declspec Stops compiler from generating code to initialize the vfptr in the constructor.
(novtable) Apply to pure interface classes.
No Throw __declspec Stops compiler from tracking unwindable objects. Apply to functions that
(nothrow) don’t throw exceptions. Recommend using the Std C exception specification
throw() instead.
Pentium Pro /G6 Optimize for PentiumPro and above (program might not run on Pentium)
Windows /GA Optimize for Windows
Fastcall /Gr Fastest calling convention
String pooling RO /GF Merge duplicate strings into one read-only memory location
Disable RTTI /GR- Turn off run time type information.
Stack probes off /Gs Turn off stack checking
Name Option Description
Exception handling /GX- Turns off exception handling (assumes program isn’t using excptn handling)
Func-level linking /Gy Only include functions that are referenced
Assume no aliasing /Oa Assume no aliasing occurs within functions
Inline any or inline /Ob2 or Inline any function deemed appropriate by compiler or turn inlining on.
expansion /Ob1 Alternates: inline, __forceinline
Global opts /Og Full loop, common subexpression and register optimizations
Intrinsic functions /Oi Replaces specific functions with inline versions (memcpy, strcpy, etc.)
Fast code /Ot Favor code speed over size (see notes below)
Omit frame pointer /Oy Omit frame pointer
Ctor displacement /vd0 Disable constructor displacement.
Best case ptrs /vmb Use best case “pointer to class member” representation

Be aware that some of these options can cause your program to fail. See the section below on
unsafe optimizations. There are also some optimizations that you might not choose to use for
your specific game. For instance, if you’re using RTTI or exception handling, don’t turn those
options off.
Optimizing for space can actually be faster than optimizing for speed because programs
optimized for speed are almost always larger, and therefore more likely to cause additional
paging than programs optimized for space. In fact, all Microsoft device drivers and Windows
NT itself are built to minimize space. Try both ways and see which is faster for your game.

5.4: Disable VTable Initialization


The Microsoft-specific __declspec(novtable) option instructs the compiler not to initialize
the virtual function pointer in the constructor of the given object. Normally, this would be a “bad
thing.” However, for abstract classes, there’s no reason to initialize the pointer, because it will
always be properly initialized when a concrete class derived from the object is constructed.
By the way, this option is misnamed. It sounds like it removes the vtable itself, which isn’t at all
true. The option should be called noinitvtable. Now consider the following example objects.
Image is a typical abstract class. Frame is derived from Image. ImageNV and FrameNV are
the same as Image and Frame respectively, except ImageNV uses the novtable option.
class Image
{
public:
Image()
{
// push ebp
// mov ebp,esp
// push ecx
// mov dword ptr [ebp-4],ecx
// mov eax,dword ptr [this]
// mov dword ptr [eax],offset Image::`vftable' (0040b0fc)
}
virtual ~Image() {} = 0;
};
class Frame : public Image {};
#define NOINITVTABLE __declspec(novtable)
class NOINITVTABLE ImageNV
{
public:
ImageNV()
{
// push ebp
// mov ebp,esp
// push ecx
// mov dword ptr [ebp-4],ecx
}
virtual ~ImageNV() {} = 0;
};
class FrameNV : public ImageNV {};

The ImageNV constructor has two fewer instructions, namely the instructions that initialize the
virtual function table. The optimized constructor is 30% faster. Microsoft’s own ATL class
library uses this compiler option extensively.

Disable VTable Initialization

1.50

1.00

0.50

0.00
Frame ctor FrameNV ctor

Series1 1.00 1.31

(see NoVTable project for benchmark code)

5.5: Indicate Functions that Don’t Throw Exceptions


The Microsoft-specific __declspec(nothrow) option instructs the compiler not to track
unwindable objects as it normally would in case an exception is thrown and objects must be
unwound on the stack.
A more portable method is to use the Standard C++ exception specification throw(). This
indicates that the specified function will not throw an exception. Here are three example
functions. MayThrow is a typical function. The compiler must assume that it could throw an
exception. NoThrowMS is specified using nothrow. NoThrowStdC is specified using throw().
Calling NoThrowMS and NoThrowStdC are about 1% faster than calling MayThrow.
// compiler assumes this function could throw any exception
int MayThrow(int i) { return (i + 1); }

// MS compiler assumes function cannot throw any exceptions


int __declspec(nothrow) NoThrowMS(int i) { return (i + 1); }

// Any Std C++ compiler assumes function cannot throw any exceptions
int NoThrowStdC(int i) throw() { return (i + 1); }
Disable Exception Throw ing

1.20
1.00
0.80
0.60

0.40
0.20

0.00
MayThrow NoThrow MS NoThrow StdC

Series1 1.00 1.01 1.01

(see NoThrow project for benchmark code)

5.6: Use the Fastcall Calling Convention


The Microsoft Visual C++ compiler supports the following function calling conventions: cdecl,
stdcall and fastcall (listed above). Fastcall is roughly 2% faster than cdecl on a typical function
call. Use fastcall. Your program will thank you.

Calling Convention

1.200

1.000
0.800

0.600

0.400

0.200

0.000
CDecl StdCall FastCall

Series1 1.000 1.015 1.023

(see CallingConvention project for benchmark code)

5.7: Warning: Unsafe Optimizations Ahead


Don’t change your optimization settings recklessly. Although most settings would never cause
your program to crash, there are some settings that should be used only when you know your
code is conforming to Microsoft’s recommendations for that particular setting.
The following table lists all potentially risky optimizations.
Name Option Notes
Pentium /G5 Code won’t run on 486 or below (use /GB instead)
Pentium Pro /G6 Code won’t run on Pentium or below (use /GB instead)
String pooling /Gf If a string is modified, it will be modified for any variable that points to it
String pooling RO /GF If a string is modified, memory exception occurs
Stack probes off /Gs A stack overflow will crash the program without an overflow error
Exception handling /GX- If exception handling is not enabled, an exception may crash the program
Name Option Notes
Assume no aliasing /Oa If there is aliasing in the program, the optimization can cause corrupted data
Inline expansion /Ob1 Inlines can cause unexpected code bloat and cache misses
Inline any /Ob2 Inlines can cause unexpected code bloat and cache misses
Intrinsic functions /Oi Intrinsic functions increase code size
Float consistency /Op Resulting floating-point code will be larger and slower
Ctor displacement /vd0 A virtual function may be passed an incorrect “this” pointer if it is invoked from
within a constructor or destructor.
Struct packing /Zpn Can cause compatibility problems if packing is modified

Appendix A: STL Container Efficiency


Container Stores Overhead [] Iterators Insert Erase Find Sort
list T 8 n/a Bidirect’l C C N N log N
deque T 12 C Random C at begin or C at begin or N N log N
end; else N/2 end; else N
vector T 0 C Random C at end; else N C at end; else N N N log N
set T, Key 12 n/a Bidirect’l log N log N log N C
multiset T, Key 12 n/a Bidirect’l log N d log (N+d) log N C
map Pair, Key 16 log N Bidirect’l log N log N log N C
multimap Pair, Key 16 n/a Bidirect’l log N d log (N+d) log N C
stack T n/a n/a n/a C C n/a n/a
queue T n/a n/a n/a C C n/a n/a
priority_ T n/a n/a n/a log N log N n/a n/a
queue
slist T 4 n/a Forward C C N N log N

• C = amortized constant time


• Overhead indicates approximate per-element additional heap space required in bytes.
• Inserting at the end of a vector may cause the vector to be resized; resizing a vector is
O(N). However, the amortized time complexity for vector insertions at the end is constant.
• slist included for comparison only. Singly-linked lists are not standard STL containers.

Appendix B: Relative Costs of Common Programming Operations

The following table was calculated on a PII-400 running NT (a 32-bit OS). It shows
performance relative to a standard 32-bit assignment operation (value 1.00). Larger numbers
indicate better performance. For integer and floating point operations, the fastest relative time
for each operation is highlighted.
Relative Performance of Common Operations (PII-400)
8-bit (char) 16-bit (short) 32-bit (long) 64-bit (__int64) floating-point
Operation signed unsigned signed unsigned signed unsigned signed unsigned float double ldouble
a=b 1.00 1.00 0.64 1.00 1.00 0.87 0.58 0.58 0.87 0.64 0.58
a+b 1.17 0.88 1.17 1.17 1.05 1.00 0.70 0.78 0.88 0.47 0.44
a-b 1.17 0.87 1.17 0.88 1.17 0.87 0.70 0.70 0.91 0.54 0.47
-a 1.00 1.16 1.00 1.10 1.17 1.00 0.87 1.00 1.17 0.51 0.51
++a 0.78 0.88 0.88 0.59 0.82 0.88 0.64 0.63 0.77 0.34 0.30
--a 0.82 1.00 0.58 0.54 0.87 1.00 0.64 0.58 0.82 0.34 0.31
a*b 0.88 1.00 1.04 1.00 1.00 1.00 0.38 0.37 1.00 0.29 0.47
a/b 0.18 0.18 0.18 0.18 0.18 0.18 0.07 0.08 0.22 0.16 0.14
a%b 0.18 0.18 0.18 0.18 0.18 0.18 0.07 0.08 n/a n/a n/a
a == b 1.00 0.78 1.00 0.78 1.00 0.87 0.64 0.64 0.58 0.36 0.26
a != b 1.00 0.87 1.00 0.87 0.88 1.00 0.58 0.50 0.58 0.33 0.30
a>b 1.00 0.78 1.00 0.78 0.87 0.87 0.58 0.54 0.58 0.25 0.34
a<b 1.00 0.88 0.54 0.88 0.88 0.88 0.47 0.50 0.64 0.25 0.25
Relative Performance of Common Operations (PII-400)
8-bit (char) 16-bit (short) 32-bit (long) 64-bit (__int64) floating-point
Operation signed unsigned signed unsigned signed unsigned signed unsigned float double ldouble
a >= b 1.00 0.87 1.00 0.87 1.00 0.78 0.54 0.58 0.54 0.37 0.25
a <= b 1.00 0.78 1.00 0.78 1.00 0.87 0.44 0.47 0.54 0.35 0.30
a && b 0.70 0.54 0.70 0.54 0.70 0.70 0.50 0.47 n/a n/a n/a
a || b 0.88 0.87 0.70 0.70 0.78 0.78 0.70 0.70 n/a n/a n/a
!a 0.87 0.88 0.87 0.99 0.78 0.78 0.58 0.58 0.50 0.35 0.35
a >> b 1.00 0.87 1.00 0.88 1.17 1.17 0.54 0.54 n/a n/a n/a
a << b 1.17 0.88 1.00 1.00 1.15 1.17 0.50 0.50 n/a n/a n/a
a&b 0.94 0.87 1.00 1.17 1.00 1.00 0.78 0.70 n/a n/a n/a
a|b 1.17 0.87 1.17 0.88 0.87 1.00 0.70 0.78 n/a n/a n/a
a^b 1.00 1.00 1.00 1.00 0.88 0.87 0.70 0.70 n/a n/a n/a
~a 1.00 1.16 1.00 1.16 1.40 1.00 0.88 0.88 n/a n/a n/a

• There’s very little difference between 8-, 16- and 32-bit operations. In general, signed 32-bit
values give the fastest times. That’s why it makes sense to use int or long as your standard
integer type.
• Operations on unsigned types tend to be slower than the same operation on signed types.
• The slowest operations are division and modulus. In fact, floating-point division is as fast or
faster than integer division.
• Float operations are typically 1.5 to 2 times faster than double operations. If you can afford
the loss of precision, float is the best floating-point type.
Look at the same table for a lower-end machine, a P-133 running Windows 95.
Relative Performance of Common Operations (P-133)
8-bit (char) 16-bit (short) 32-bit (long) 64-bit (__int64) floating-point
Operation signed unsigned signed unsigned signed unsigned signed unsigned float double ldouble
a=b 1.00 0.31 0.63 0.77 1.00 1.00 0.91 0.91 1.00 0.55 0.83
a+b 0.71 0.91 0.71 0.83 1.00 1.00 0.63 0.24 0.71 0.43 0.33
a-b 0.71 0.90 0.71 0.83 1.00 0.71 0.83 0.77 0.71 0.43 0.44
-a 0.91 1.00 0.91 0.91 0.71 1.11 0.77 0.77 0.63 0.50 0.50
++a 0.62 0.90 0.67 0.67 0.91 0.91 0.83 0.83 0.67 0.45 0.45
--a 0.91 0.91 0.67 0.56 0.91 0.91 0.77 0.83 0.67 0.45 0.45
a*b 0.45 0.50 0.45 0.50 0.59 0.59 0.34 0.34 0.71 0.44 0.44
a/b 0.16 0.17 0.16 0.17 0.13 0.19 0.09 0.09 0.23 0.17 0.19
a%b 0.16 0.17 0.16 0.17 0.17 0.20 0.09 0.09 n/a n/a n/a
a == b 0.67 0.55 0.63 0.71 0.91 0.91 0.66 0.67 0.36 0.30 0.25
a != b 0.31 0.77 0.62 0.71 0.91 0.91 0.62 0.63 0.40 0.29 0.29
a>b 0.67 0.77 0.62 0.71 0.91 0.83 0.52 0.53 0.40 0.29 0.30
a<b 0.67 0.77 0.62 0.55 0.91 0.83 0.41 0.36 0.40 0.25 0.30
a >= b 0.67 0.77 0.50 0.71 0.91 0.83 0.50 0.53 0.40 0.29 0.25
a <= b 0.67 0.77 0.62 0.71 0.91 0.83 0.52 0.53 0.42 0.29 0.30
a && b 0.55 0.62 0.50 0.56 0.62 0.50 0.48 0.47 n/a n/a n/a
a || b 0.77 0.83 0.67 0.71 0.62 0.83 0.50 0.38 n/a n/a n/a
!a 0.77 0.63 0.71 0.77 0.91 0.91 0.62 0.62 0.42 0.33 0.33
a >> b 0.59 0.71 0.59 0.71 0.83 0.83 0.67 0.71 n/a n/a n/a
a << b 0.59 0.71 0.54 0.56 0.83 0.83 0.67 0.71 n/a n/a n/a
a&b 0.71 0.91 0.55 0.83 1.00 1.00 0.83 0.83 n/a n/a n/a
a|b 0.71 0.91 0.71 0.83 1.00 1.00 0.62 0.45 n/a n/a n/a
a^b 0.71 0.91 0.71 0.83 1.00 0.71 0.83 0.83 n/a n/a n/a
~a 0.91 1.00 0.91 0.91 0.71 1.11 0.83 0.83 n/a n/a n/a

• On lower-end machines, a 32-bit signed int is clearly the fastest size.


• Operations on unsigned 8- and 16-bit types are typically faster than the same operation on
signed types.
• Integer division: still slow. Floating-point division: slow, but still faster than integer division.
Another thing to be aware of is the cost of converting between types. The baseline case is the
“conversion” from a 32-bit signed int to another 32-bit signed int.
int32 to int32 int8 to int32 int32 to int8 double to int32 int32 to double int32 to int64
PII 1.00 0.99 0.92 0.10 0.80 0.74
Pentium 1.00 0.80 0.67 0.20 0.62 0.73

Most conversions are reasonable. But beware the conversion of floating-point to integer! It’s
five to ten times slower than the base case. Interestingly enough, it’s also significantly slower
on the Pentium-II compared to the Pentium.
(see CommonOps project for benchmark code)

Appendix C: Code Tuning and C++ Efficiency Resources

C.1: Profilers and Software tools


NuMega TrueTime (www.numega.com) From the makers of BoundsChecker. Highly
recommended. Accounts for context switches during program runs, so your results only
include the time that the CPU spent executing your application. Also profiles Java code.
Intel VTune (developer.intel.com) Profile down to the assembly level. Includes a C++ code
coach that recommends code changes that will improve performance.
Rational VisualQuantify (www.rational.com) From the makers of Purify. Similar to TrueTime.
TracePoint HiProf (www.tracepoint.com) No longer distributed, but if you can get your hands
on a copy of version 2.0, it kicks major profiling butt. Easily the best profiling tool I’ve used.
Metrowerks CodeWarrior Analysis Tools Suite (www.metrowerks.com) Metrowerks
acquired TracePoint’s HiProf technology in the spring of 1998 and now uses the same
technology in their new product CATS. Supports both CodeWarrior and Visual C++ 4.2 and up.
PC-Lint (www.gimpel.com) Detects parameters that could be passed by reference instead of
by value, postfix increments that could be made prefix, and other efficiency bottlenecks, as
well as a host of common C++ (and C) programming errors.

C.2: Books
Effective C++ and More Effective C++, Scott Meyers (www.aristeia.com) Superb tips from a
C++ guru, from basic techniques to copy-on-write and multiple dispatch. The chapter on
efficiency in More Effective C++ is a gem.
Code Complete, Steve McConnell (www.construx.com) General purpose code-tuning
strategies, with examples and results.
C++ Gems, Stan Lippman (people.we.mediaone.net/stanlipp) Excellent template essays.
Writing Efficient Programs, Jon Bentley (www.engr.orst.edu) A general set of rules and a
standard methodology for optimizing any program. Very concise. Includes examples.
Optimizing C++, Steve Heller (www.koyote.com/users/stheller) Algorithmic optimizations.
Graphics Programming Black Book Special Edition, Michael Abrash (www.amazon.com)
The bible on x86 assembly optimization, especially for 3D graphics.
Graphics Gems I – V, Glassner, et al. (www.acm.org/tog/GraphicsGems) Efficient algorithms
designed by graphics programmers and researchers.
Inner Loops, Rick Booth (ourworld.compuserve.com/homepages/rbooth) More x86 assembly
optimization.
High Performance Computing, 2nd Edition, Kevin Dowd (www.oreilly.com/catalog/hpc2) A
high-level look at optimization.
C.3: Websites
SGI STL extensions (www.sgi.com/Technology/STL)
SGI Singly-linked lists A non-standard list class. The SGI implementation has half the
overhead and is considerably faster than the standard list class.
SGI Rope strings A highly efficient implementation of long strings. Useful for e-mail
text, word processing, etc.
SGI Hashed containers Hashed containers are not included in the STL. These SGI
hashed containers can be just what the doctor ordered.
Todd Veldhuizen articles (extreme.indiana.edu/~tveldhui)
Todd has been on the leading edge of the use of C++ templates for the sake of
efficiency, especially Template Metaprogramming. He’s also the lead programmer for
Blitz++, a computing library that makes use of many of his techniques.
Nathan Myers articles (www.cantrip.org)
Nathan is a prolific writer on many C++ topics. Be sure to see his article on the Empty
Member optimization.
Guru of the Week (www.cntc.com/resources)
Problems and solutions to many common C++ issues, as presented by one of the lead
programmers of PeerDirect, Inc, including Avoiding Temporaries, Inline, Reference
Counting and Copy On Write.
High Performance Game Programming in C++ (www.ccnet.com/~paulp/HPGP/HPGP.html)
Paul Pedriana’s talk from the 1998 CGDC. A great discussion of C vs. C++ and C++
performance issues.
C++: Efficiency, Alan Clarke (www.ses.com/~clarke/Efficiency.html) A collection of C++ tips.
Object-Oriented System Development (gee.cs.oswego.edu/dl/oosdw3/ch25.html) The
Performance Optimization chapter from Dennis de Champeaux’s book.
High Performance C++ (oscinfo.osc.edu/software/KAI/doc/UserGuide/chapter_3.html) A
discussion of C++ and compiler optimizations from KAI, the leading vendor of optimizing C++
compilers for high-end systems (Cray, SPARC, SGI, etc.)
NULLSTONE Compiler Optimization Categories (www.nullstone.com/htmls/category.htm) A
good list and description of C/C++ compiler optimization techniques.
Optimization of Computer Programs in C (www.ontek.com/mikey/Optimization.html) A
dated but useful paper by Michael Lee on C-specific optimizations
Code Optimization – The Why’s and How’s (seka.nacs.net/~heller/optimize) A collection of
pages by Jettero Heller discussing code optimization
Performance Engineering: A Practical Approach to Performance Improvement
(www.rational.com/sitewide/support/whitepapers/dynamic.jtmpl?doc_key=307) A discussion of
profiling and bottlenecks by Rational Software
Maui High Performance Computing Center: Performance Tuning
(www.mhpcc.edu/training/workshop/html/performance) Performance tuning on IBM UNIX
systems. Some details aren’t relevant, but most of the examples are platform independent.

C.4: Author
Email the author at: Pete.Isensee@WON.net or PKIsensee@msn.com
Author’s homepage: www.tantalon.com
Appendix D: C++ Optimization Summary

D.1: Design Considerations


 Take advantage of STL containers
 Consider using references instead of pointers
 Consider two-phase construction
 Limit exception handling
 Avoid RTTI
 Prefer stdio to iostream
 Evaluate alternative libraries
 Do choose proper data structures, algorithms and program architecture

D.2: General Recommendations


 Use signed integers
 Use integers appropriate to the OS and platform
 Use consistent integer sizes to avoid costly conversions
 Use smaller floating-point representations if you can afford the loss of precision
 Avoid converting between floating-point and integers, and vice versa
 Don’t optimize as you go
 Automate profiling

D.3: “As You Go” Optimizations


 Pass class parameters by reference
 Postpone variable declaration as long as possible
 Prefer initialization over assignment
 Use constructor initialization lists
 Prefer operator= over operator alone
 Prefer prefix operators
 Use explicit constructors
 Use throw() to indicate functions that don’t throw exceptions
 Use the fastcall calling convention (Microsoft compiler only)
 Use the novtable option for abstract classes (Microsoft compiler only)

D.4: Final Optimizations


 Don’t assume anything. Profile, optimize, profile, optimize, profile, . . .
 Consider inline functions
 Use the return value optimization
 Avoid virtual functions
 Return objects via reference parameters
 Consider per-class allocation
 Consider STL container allocators
 Empty member optimization
 Template metaprogramming
 Consider copy on write techniques

Acknowledgments

Special thanks to Brian Fiete, Melanie McClaire, Brian Ruud, the WON Viper and Titan teams,
the HyperBole X-Files engineering team, my favorite gurus Steve McConnell and Scott
Meyers, and my favorite girls Kristi and Ali.

You might also like