You are on page 1of 10

Detecting and Eliminating Potential Violation of Sequential Consistency for

Concurrent C/C++ Programs


Yuelu Duan, Xiaobing Feng, Lei Wang, Chao Zhang
Key Laboratory of Computer System and Architecture,
Institute of Computing Technology, Chinese Academy of Sciences
Beijing, China
Email: {duanyuelu,fxb,wlei,zhangchao}@ict.ac.cn
Pen-Chung Yew
Department of Computer Science and Engineering,
University of Minnesota at Twin-Cities
Minneapolis, USA
Email: yew@cs.umn.edu
AbstractWhen a concurrent shared-memory program writ-
ten with a sequential consistency (SC) model is run on a
machine implemented with a relaxed consistency (RC) model,
it could cause SC violations that are very hard to debug. To
avoid such violations, programmers need to provide explicit
synchronizations or insert fence instructions.
In this paper, we propose a scheme to detect and eliminate
potential SC violations by combining Shasha/Snirs conict
graph and delay set theory with existing data race detection
techniques. For each execution, we generate a race graph,
which contains the improperly synchronized conict accesses,
called race accesses, and race cycles formed with those accesses.
As a race cycle would probably lead to a non-sequential-
consistent execution, we call it a potential violation of sequential
consistency (PVSC) bug. We then compute the race delays of
race cycles, and suggest programmers to insert fences into
source code to eliminate PVSC bugs. We further convert a
race graph into a PC race graph, and improves cycle detection
and race delay computation to O(n
2
), where n is the number
of race access instructions.
We evaluate our approach with the SPLASH-2 benchmarks,
two large real-world applications (MySQL and Apache), and
several multi-threaded Cilk programs. The results show that (1)
the proposed approach could effectively detect PVSC bugs in
real-world applications with good scalability; (2) it retains most
of the performance of the concurrent program after inserting
required fence instructions, with 6.3% performance loss on
average; and (3) the additional cost of our approach over
traditional race detection techniques is quite low, with 3.3%
on average.
Keywords-Sequential consistency; data race detection; delay
set; fence; relaxed memory model.
I. INTRODUCTION
Sequential consistency [27] model is inarguably the most
intuitive and natural memory consistency model for pro-
grammers to develop concurrent programs. However, the
execution of such programs on a relaxed consistent (RC)
architecture, on which some load/store instructions are
allowed to be re-ordered in a processor for a higher per-
formance, could cause SC violations and produce incorrect
results. To avoid such SC violations, programmers need to
write data-race-free programs using appropriate synchro-
nization, such as locks. However, such synchronizations
could serialize concurrent program execution, and cause
severe performance degradation in some cases. Programmers
often avoid such synchronizations as much as possible
to achieve a higher performance. A well known example
is the Double Checked Locking (DCL) for Lazy Initial-
ized Singleton [33], as illustrated in Figure 1.a. Because
these programs are not data-race-free, they might get non-
sequential-consistent (non-SC) results (see Figure 1.b) when
executed on machines with a RC model in which memory
access reordering is allowed [28], [34].
Existing data race detection tools could help to detect
and locate data races. However, data races in low-lock
or lock-free programs, such as DCL in Figure 1, are
likely to be deliberately employed by programmers. Even
though data races are intimately related to SC violations,
not all data races would cause SC violations on a particular
RC model. Without a clear understanding of the memory
consistency models and a well dened context, programmers
would probably ignore the data race warnings generated by
these detection tools.
One way to guarantee SC without locking is inserting
memory ordering fences between shared memory accesses
to delay issuing next memory access until previous ones are
completed, as illustrated in Figure 1.c. Sufcient insertion
of fences could guarantee SC for the program. Although,
fence operations are cheaper than locks (they are more
ne-grained), excessive use of fences would still hurt per-
formance. Making these programs execute both correctly
and with a high performance on relaxed machines requires
proper insertion of memory fence instructions. Several exist-
ing schemes that employ program verication and compiler
analysis could be used to tackle the fence insertion problem.
However, as will be discussed in section V, they have their
own limitations and disadvantages.
In this paper, we try to combine Shasha/Snirs conict
graph and delay set theory [1] with existing data race
detection techniques to insert fences for the enforcement of
a SC model on a RC platform. For each execution of a
program, we build a race graph which contains the detected
data race accesses that are improperly synchronized, and
2009 International Symposium on Code Generation and Optimization
978-0-7695-3576-0/09 $25.00 2009 IEEE
DOI 10.1109/CGO.2009.29
25
Object Object::getInstance() {
if (!_instance) {
lock(L);
if (!_instance) {
// three steps in new Object()
tmp = malloc(sizeof(Object));
tmp->field = 100;
_instance = tmp;
}
unlock(L);
}
return _instance;
}
void useInstance() {
Object ins = Object::getInstance();
int f = ins->field;
}
two threads that execute
useInstance() concurrently
Thread 1 (T1):
ins1 = Object::getInstance();
f1 = ins1->field;
Thread 2 (T2):
ins2 = Object::getInstance();
f2 = ins2->field;
reordering in T1
4: tmp->field = 100;
1: _instance = tmp;
read un-initialized field in T2
2: if (!_instance) {};
3: read _instance->field;
Object Object::getInstance() {
tmp = _instance;
fence;//prevent r-r reorder
if (!tmp) {
lock(L);
if (!_instance) {
tmp = malloc(sizeof(Object))
tmp->field = 100;
fence;//prevent w-w reorder
_instance = tmp;
}
unlock(L);
}
return tmp; }
(a) Double Checked Locking (DCL) for Lazy Initialized Singleton,
in which data race on _instance is deliberately employed
(b) A possible execution that violates Sequential Consistency of DCL
(c) Correcting DCL with two memory ordering fences
Interleaving Reordering
Figure 1: The DCL example that potentially violates SC on RC models [33], [34].
Thread 1
//_instance->field=100;
A: store addr1, 100;
//_instance=tmp;
B: store addr2, tmp;
Thread 2
//if (!_instance) {};
C: load addr2, reg2;
//read _instance->field
D: load addr1, reg1;
data race edge program order edge
Figure 2: Race cycle extracted from the DCL. Given a race cycle
ABCDA, it is able to generate a non-SC execution, e.g.
E={BC, DA}.
the race cycles formed by those accesses, as illustrated in
Figure 2. A race cycle could lead to a possible SC violation,
thus, is referred to as a potential violation of sequential
consistency (PVSC) bug in the rest of the paper. We further
convert the race graph into a more efcient PC Race Graph
to make it scalable to large applications. Detecting PVSC
bugs can be solved by nding all strong connected compo-
nents (SCCs) in the graph in polynomial time O(n
2
), where
n is the number of race access instructions. We then compute
the race delay set according to the SCCs. Finally, we use
a straightforward fence insertion algorithm that places one
fence instruction for each delay, and suggest programmers to
insert fences into the source code. The inserted fences would
be respected by the compiler and enforced in hardware, thus
eliminate the detected PVSC bugs.
Our proposed scheme has been implemented on the open-
source data race detector Helgrind [36]. We apply it to a
wide range of concurrent programs including the SPLASH-
2 benchmarks, two real-world applications (MySQL and
Apache), and several multi-threaded Cilk [35] programs. We
identied and xed several PVSC bugs that could cause SC
violations in those applications, with part of them conrmed
by the developers. Further, experimental results show that the
proposed scheme inserts fewer unnecessary fences than ex-
isting static compiler techniques, and the incurred overhead
is negligible.
In short, our contributions are as follows:
An effective and efcient scheme is proposed to detect
potential violation of sequential consistency (PVSC)
bugs for concurrent C/C++ programs. Our scheme
could detect whether multiple data races as a whole
could cause violations of SC or not - a step further
than the existing data race detection schemes as those
data race only detection schemes detect and report in-
dividual data races. We have discovered several PVSC
bugs in large real-world applications. Part of the bugs
has been conrmed by the developers. We select two
of them to present and analyze in detail in Section
IV-A. Such bugs are difcult to detect using existing
approaches.
Our approach of PVSC detection is scalable. By con-
verting the race graph into a PC race graph, we are
able to scale to large applications with low overhead.
The additional cost of PVSC detection over data race
detection is quite low (3.3% on average), making it also
quite efcient to be ported to any data race detection
tool.
Our proposed approach requires fewer fence instruc-
tions to prevent PVSC bugs, thus, could retain most of
the performance in the original concurrent program.
Using our scheme, the performance degradation for
SPLASH-2 programs is averagely 6.3% compared to
their original performance with compilers optimization
ags turned on.
As our approach of detecting SC violations is based on
data race detection schemes, it inherently suffers from the
same limitations as those schemes. False negatives resulted
from data race detection could lead to undetected PVSC
bugs, and false positives would lead to unnecessary fence
instructions. Progress in data race detection techniques could
help to overcome these limitations.
The remainder of the paper is organized as follows. Sec-
tion II describes the background of the data race detection
techniques, existing compiler techniques and verication
tools. Section III describes our data-race based scheme to de-
tect PVSC bugs and to eliminate them. Section IV analyzes
the results of our scheme on the tested applications. Section
V presents some related work, and section VI concludes the
paper.
26
II. BACKGROUND
A. Sequential Consistency
In uni-processor systems, programmers expect a load
from some memory location would return the value of the
last store to that location. On a multiprocessor system, this
can be extended intuitively and naturally to the sequential
consistency model. In the sequential consistent model, all
memory accesses appear to be executed atomically and fol-
low a total order, while memory accesses of each execution
thread follow the program order [27].
B. Conict Graph and Delay Set Computation
To guarantee sequential consistency of a concurrent pro-
gram, we need not execute every instruction atomically in
a total order, in which each instruction is delayed until
previous one completes. In fact, only a part of the delay
is necessary. ShaSha and Snir characterized the minimal set
of delays required to preserve sequential consistency [1].
Two memory accesses conict if they access the same
memory location, and at least one of them is a store (the
denition here is slightly more conservative than that in [1],
however, it does not affect the correctness). We refer to U
as the set of memory accesses during an execution of a
concurrent program, C as the conict relation on U, E as
the execution order, and P as the program order. We also
view P

C as the conict graph.


E is an orientation of C, and should be consistent with
P if P

E could be extended to a total order, which means


that P

E should contain no cycles. If E is consistent with


P, then we can get a sequential consistent result from E.
We refer to such an execution E as a sequential consistent
execution.
A critical cycle is a cycle in the conict graph P

C,
which indicates that we might get an E that is not consistent
with P. If the program order (P edges) of all the critical
cycles is enforced by delays, we would get an E of C that
is sequential consistent. We refer to such a delay relation as
D. uDv indicates that u must complete before v is issued. The
following lemma formally describes the behavior of memory
access pairs in a delay relation:
Delay Lemma [1]. For any execution, E should be
consistent with D.
Further, if the delay relation D contains all P edges of
all the critical cycles in P

C, it could enforce sequential


consistency [1]. We also refer to all the relations in D as
the delay set. A minimal set of D that enforces SC is called
D
m
.
Finding minimal delay set D
m
is difcult [3]. Instead,
some compiler schemes [2], [3] try to nd the delay set
D that is sufcient to enforce sequential consistency. Some
schemes, e.g. the scheme in [3], simply treat all mem-
ory accesses that might access shared locations as conict
accesses, and put all pairs of successive conict accesses
into the delay set. As a result, such delay relations in
the delay set D are more likely to be redundant. Schemes
that aim to reduce such redundancy in the delay set have
been proposed for concurrent languages that employ simple
synchronization and little aliasing, such as Titanium [2].
However, their techniques cannot be extended to C/C++
programs because the extensive use of pointers in such
programs could seriously complicate the analysis. In section
III, we describe our approach that could signicantly reduce
redundancy in D for the concurrent C/C++ programs.
C. Preserving SC with fences
Modern commercial multi-processors provide memory or-
dering fence instructions to enforce the delays that might
be violated by the reordering allowed in the processors.
Commercial architectures have various names and semantics
for the fence instructions. For the sake of clarity, we use
the denition in [4]: before executing a fence instruction,
all previous instructions of that execution thread should be
completed.
Different relaxed memory consistency models have their
own ordering constraints [3]. We call the delays that could
be enforced by a memory consistency model (often im-
plemented in hardware) as implicit delays, e.g. a store-
store delay is implicit on the x86 model [38] because the
x86 model enforces the order of two consecutive store
operations in its memory consistency model. Implicit delays
need not be enforced explicitly with fence instructions at the
binary code level, but should be respected when programs
are compiled from source code to binary.
If we insert a fence instruction for each delay in the
delay set D, then D can be enforced properly and sequential
consistency can be guaranteed. A naive algorithm may
insert more fences than needed to enforce D [4]. Some
optimization schemes have been proposed to reduce the
number of fences for D [4], [3]. These optimizations could
improve signicantly on conservatively generated delay sets,
however, with little improvement on the highly optimized,
manually generated delay sets [3]. Furthermore, fences that
already exist in a concurrent program could also enforce the
delays. Some verication techniques [25], [24] recognize the
fences that already exist and take them into account. This
could reduce the number of fence instructions inserted.
III. DETECTION AND ELIMINATION OF POTENTIAL
VIOLATION OF SEQUENTIAL CONSISTENCY
As described above, existing compiler approaches rely on
accurate concurrency analyses to achieve good performance
for the program inserted with fences. However, most general
C/C++ programs are difcult for precise concurrency anal-
ysis because of pointer alias analysis. We instead detect and
eliminate potential violations of SC by using the information
collected during a program execution.
27
Thread 1 Thread 2
(a) Race Graph (b) PC Race Graph
Private access Conflict access
Race access
Lock(L)
Program order
Unlock(L)
Race edge
Conflict edge
A-G: program counter (PC) values of the access instructions
A
A
B
C
D
E
F
G
A
F
G
Figure 3: Race Graph and PC Race Graph.
A. Race Graph and Race Delay Set
A data race occurs when two conict accesses are not
intervened by proper synchronizations. This indicates that
conict accesses do not cause a data race if properly syn-
chronized, however a data race is always caused by conict
accesses. Correctly synchronized conict accesses would not
produce non-SC results [26]. Hence, we only need to detect
data races, which is a small subset of all conict accesses
because, in most of the well written concurrent applications,
few of the conict accesses are not properly synchronized.
We refer to R as the race relation. uRv indicates that u
and v forms a data race, and are both race accesses. Similar
to the conict graph, we refer to P

R as the race graph,


shown in Figure 3.a. The race graph is very similar to
the conict graph in [1]. The main difference is that edges
between conict accesses are removed while edges between
race accesses are added.
Informally, we call a race cycle in the race graph a
Potential Violation of Sequential Consistency bug, since its
existence potentially leads to a non-SC execution (see Figure
2), similar to cycles in the conict graph. To eliminate PVSC
bugs, the minimal delay set has to be enforced. We call the
delay relations indicated by P edges in the race cycles as the
race delay set, denoted as D. In fact, D is a subset of D
that could be obtained from the critical cycles of the conict
graph by removing non-race-accesses and their edges. We
have the following theorem:
THEOREM 3.1 D enforces the sequential consistency
for the execution that generates it.
Proof: Suppose D does not enforce SC for the ex-
ecution that generates it, then there is a non-SC E which
is consistent with D (by the delay lemma). Let D be
the delay set obtained from the critical cycles. Since D
enforces SC [1], it is only possible that E is inconsistent
with D - D, otherwise, E is sequential consistent. However,
memory accesses that are associated with the delays in D -
D are conict accesses, but not data race accesses. These
memory accesses are properly synchronized and sequentially
executed in any E. This contradicts to the assumption.
As there are much fewer cycles in the race graph than
in the conict graph, the number of delays is reduced. This
allows us to avoid most unnecessary delays compared to
other existing compiler techniques, especially for concurrent
C/C++ programs.
B. PC Race Graph
The race graph is sufcient to detect SC violations. How-
ever, as it is produced based on the information collected
from a program execution, its size and complexity render
it impractical for large real-world applications. These large
applications pose three major challenges.
Large code size. The real-world concurrent applica-
tions, e.g. MySQL and Apache, have thousands of lines
of code. It would not be practical if a scheme cannot
handle such applications. Most existing verication
tools [25],[24] seriously suffer from this problem. Com-
piler analysis could also be sensitive to this problem.
The dynamic data race detection scheme adopted in our
approach is not sensitive to the code size.
Long program execution time. Many applications, es-
pecially those on servers, may execute for a long time
once started. Meanwhile, potential violations of SC
might appear sporadically during the long program ex-
ecution. Hence, the ability to detect the PVSC scenarios
after a long program execution is required. Also, as the
race graph needs to record every executed instruction
during the long program execution, the cost of its
storage and the race cycle detection overhead could be
substantial.
Many threads. Since long running server applications
may continuously spawn and kill execution threads
to start and stop server sessions, there could be a
large number of execution threads. Maintaining the
race graph from all the execution threads needs to be
efcient.
To overcome the last two challenges, we convert the race
graph into the PC race graph. The transformation follows
two simple steps (illustrated in Figure 3):
1) : Instructions from the same execution thread that
have the same program counter are combined to a single
node - a program counter node. Program order edges and
data race edges are updated during this process. Using this
transformation, the size of a long running application would
not exceed O(PN), where P is the program size and N is
the thread count.
2) : Nodes from different threads with the same PC
are also combined into the same node. After this step, the
original race graph is converted into the PC race graph,
with the size further reduced to O(P).
Intuitively, all race cycles in the race graph can also be
found in the PC race graph. Thus, all PVSC bugs in the
race graph could also be detected in the PC race graph.
Furthermore, delay relation found in the race graph can also
28
be found in the PC race graph. That is, the PC race graph
does not introduce additional false negatives.
C. Construction of PC Race Graph
As described in the previous section, the PC race graph
consists of three parts: (1) data races edges, (2) program
order edges of race accesses in each thread, and (3) program
counter value of each race access.
Data race edges can be obtained by data race detection.
There have been many schemes proposed for data race detec-
tion, such as lock-set, happens-before and hybrid. Intuitively,
the accuracy of data race detection affects the accuracy of
PVSC detection. In section IV-B, we will evaluate these
different schemes.
The program order edges of the race accesses can be
obtained by dynamic program execution proling. Because
some compiler optimizations would change the order of
certain memory accesses and distort the original intent of
the programmer and interfere with the PVSC detection,
hence, no compiler optimization is allowed during PVSC
detection. However, this usually will have minimal impact
on the programs runtime performance because, after the
PVSCs bugs are identied and proper fences are added,
the corrected program could be re-compiled with all of the
optimization ags set to the originally intended level for high
runtime performance. The added fence instructions to avoid
PVSC bugs using our proposed scheme would cause very
minimal performance impact as shown in Section IV-B.
The program counter value of each instruction can also
be obtained during the dynamic program proling. Besides
identifying memory access instructions, the program counter
values could also help the programmer to trace and analyze
PVSC bugs at the source code level.
The overhead of updating the PC race graph during
data race detection is usually quite low, because (1) data
races are relatively seldom for well synchronized concurrent
applications, and (2) adding a data race edge into PC race
graph only needs O(1) time for each detected data race pair.
D. Detecting Race Cycle and Computing Delay Set
Since data race edges and program order edges in the
race graph are maintained in the PC race graph, the critical
cycles in the race graph also appear as cycles in the PC
race graph. Thus, we could detect the race cycles in the
race graph through detecting cycles in the PC race graph.
Furthermore, as a cycle in a graph forms a strong connected
component (SCC), this problem can be solved by nding
out all SCCs in the PC race graph in O(n
2
) time, where n
is the number of race access instructions. Similarly, we put
all of the program order edges in SCCs into the race delay
set D.
However, a cycle in the PC race graph may not necessarily
correspond to a critical cycle in the race graph and thus the
computed delay set could be redundant. There are mainly
Thread 1 Thread 2
A1: store x
B1: store y
C1: store z
D1: store flag
A1: store x
B1: store y
C1: store z
D1: store flag
Program order Delay Conflict/race edge
Figure 4: The delay set computed using the algorithm in [5]. At least four
of them (A1B1, B1C1, B2C2, C2D2) are redundant.
two sources of false positives introduced by the PC race
graph: (1) the program order cycles, and (2) non-critical
cycles originally in the race graph. The rst part can be
eliminated by treating data race edges and program order
edges differently in the SCC detection algorithm, that is,
SCCs that do not contain any race edges would not be
considered as PVSC bugs. However, the second source of
false positives cannot be avoided easily. In [5], it claimed
that its algorithm could compute the minimal delay set, but
in some cases non-critical pairs could also be included as
illustrated in Figure 4. Because all P edges in SCCs are
included in the delay set, non-critical P edges (A1, B1),
(B1, C1), (B2, C2), (C2, D2) are all redundantly included.
Currently, our prototype suffers the same inaccuracy.
False positives in the delay set could lead to redundant
fences that could hurt performance. Fortunately, in section
IV-B, our experimental results indicate that the delays and
fences introduced by these false positives introduce minimal
performance degradation.
E. Eliminating Potential Violation of SC
Once delays are computed, we can generate suggestions
to the programmer for fence insertion. For simplicity, we
use a naive fence insertion algorithm that inserts one fence
instruction for each delay. As mentioned in Section II-C,
although implicit delays could be enforced by hardware and
thus need not be enforced explicitly by fence instructions,
we still use fence instructions to limit reordering by compiler
optimizations as most existing compilers still do not consider
any memory consistency model. Some optimizations that
could reduce the number of fence instructions for a given
delay set is complementary to our work.
Our scheme is implemented based on Helgrind [36].
We rst compile the source code into binary without any
optimization. During the test execution, the tool would
dynamically generate a PC race graph, and then detect PVSC
bugs. Finally, it would give suggestions to insert fences
into the source code. The fences will be respected by the
compiler and hardware, thus eliminating PVSC bugs that are
discovered during the test execution. After fence insertion,
the code is safer to be re-optimized by the compiler for the
29
mutex_exit(mtx) {
//release mtx; signal waiters
release mutex;
if (mtx->waiters != 0) {
lock(wait_array);
signal cells in wait_array;
unlock(wait_array);
}
}
arr_rsv_cell(arr, mtx) {
lock(arr);
find an un-used cell in arr.
cell->rsv_time = time();
cell->object = mtx;
unlock(arr);
return cell;
}
mutex_spin_wait(mtx) {
loop:
try N time CAS(&mtx),
if succeeds, return;
c = arr_rsv_cell(wait_array,
mtx); //reserve a cell
mtx->waiters=1;
try 4 time CAS(mtx),
if succeeds, free c; return;
wait for signal;
goto loop;
}
monitor_thread() {
int fatal_count = 0;
loop:
if (sync_array_long_waits())
fatal_count++;
if (fatal_count > 5)
crash server;
sleep for 10 sec;
goto loop;
}
bool sync_array_long_waits() {
bool fatal = false;
for each cell in wait_array {
if (cell->object !=NULL)
if (time() - cell->rsv_time
> FATAL_TIME)
fatal = true;
}
return fatal;
}
function call
reordering
sync0sych.ic
sync0arr.c
sync0sych.c
srv0srv.c sych0arr.c
(a)
(b)
Figure 5: (a) mutex_spin_wait and (b) error monitor thread in MySQL.
production runs.
IV. EXPERIMENTAL RESULTS
A. Detected PVSC Bugs
We evaluate our scheme on the SPLASH-2 benchmarks,
two real-world applications MySQL and Apache, and several
multi-threaded Cilk-5 [35] programs. We nd that PVSC
bugs could be divided into two categories: (1) failure to
consider the reordering of the relaxed consistency models,
and (2) insufcient fence insertion. We use two examples
for each category in the following two subsections.
1) Server Error Monitoring Thread in MySQL: MySQL
is a popular concurrent database server application. One
of the synchronization operations they implemented is
mutex_spin_wait. As illustrated in Figure 5.a, when
a thread calls mutex_spin_wait to get a mutex, but
fails in the rst N rounds of trying, it would reserve a cell
and set the reservation time in the wait_array (named
sync_primary_wait_array in the original code). It
sets mtx->waiters to indicate that there is a thread
waiting on this mutex. It then tries a few more times (as the
owner might release mtx just after it sets mtx->waiters,
and thus would miss the releasing threads signal and wait
innitely). If it fails to get the lock again, it starts to wait for
the release signal. A thread releasing a mutex would signal
all threads waiting for that mutex.
The error monitor in the server, illustrated in Fig-
ure 5.b, asynchronously checks the wait_array. If
a cell has been reserved by a thread for too long,
it increases the fatal count. If the fatal count exceeds
ve, the server crashes itself. However, the program-
mer did not realize that even if cell->object is not
NULL, cell->rsv_time might hold an old value if
the two assignments cell->rsv_time = time() and
cell->object = mtx were reordered by the compiler
static inline void Cilk_lock(Cilk_lockvar v) {
while (Cilk_xchg((int *)v, 1) != 0) { // CAS-like
while (v[0]); /* spin using only reads - reduces bus traffic */
}
}
static inline void Cilk_unlock(Cilk_lockvar v) {
Cilk_membar_StoreStore(); /*prevents store-store reorder*/
v[0] = 0;
}
Initially, a = b = 0;
Thread 1
Cilk_lock(l);
a = 1;
R1 = b;
Cilk_unlock(l);
Thread 2
Cilk_lock(l);
b = 1;
R2 = a;
Cilk_unlock(l);
{R1=1, R2=1} is a non-sequential-consistent result, but is
possible if R1=b is reordered out of the critical section.
Reordering Interleave
(a)
(b)
Figure 6: (a) The implementation of Cilk_lock and Cilk_unlock,
and (b) an example program that leads PVSC.
or hardware. In this case, the monitor thread would see an
old, fatal reservation time of a cell, increasing the possibility
for the server to crash itself.
2) Cilk 5.4.6 Cilk_unlock() implementation: Cilk
[35] is a multi-threaded language. Cilk concurrent library
provides customized lock operations, such as Cilk_lock
and Cilk_unlock, as illustrated in Figure 6.a. The
Cilk_unlock places a store-store fence before reset-
ting a lock variable, to prevent memory access reordering.
However, our prototype detects that it is insufcient to
guarantee sequential consistency even for a program that
properly uses Cilk_lock and Cilk_unlock, as shown
in Figure 6.b. If thread 1 gets the lock rst, R1=0, R1=1
is the only correct result. However, since a store-store
fence does not prevent load-store reordering, on a RC
architecture, e.g. a DEC Alpha processor, the load oper-
ation R1=b could be reordered after the store operation
v[0]=0 in Cilk_unlock(), and thus a non-SC result
R1=1, R2=1 is possible. This problem is revealed recently
by our prototype, and has been conrmed by the developer.
3) Bug summary: Apart from the above two bugs, we
have detected another two in MySQL, three in Apache and
several PVSC bugs in SPLASH-2 programs, as summarized
briey in Table I. These results demonstrate the validity of
our approach.
B. Impact of fence insertion on performance
1) Evaluation methodology: To see the impact on pro-
gram performance by the fence insertion using our approach,
we adopt it into three race detection algorithms. We compare
their performance with the scheme using a compiler analysis
to insert fences on nine of the SPLASH-2 benchmarks. We
replace the user-dened barrier operations with ours that
could be recognized. In test runs, we disable compiler opti-
mizations to obtain the original intention of programmers.
No Fence: No fence instruction is inserted. Since vio-
lation of SC seldom appeared in reality, the executions
30
Table I: A brief summary of some detected PVSC bugs in the evaluated applications.
App. le/subrountine Brief PVSC Bug Description
MySQL 5.0.x sql/sql clas.cc, Inconsistent thread status recording. add_to_status() was not
add_to_status() properly synchronized, thus the result of status vector can be inconsistent.
MySQL 5.0.x sql/slave.c, mi->slave_running=0 could be visible to other threads before
handle_slave_io() the cleanup is completed. Thus causes assertion during slave shutdown.
httpd 2.2.x modules/cache/mod cache.c, store_header() might be visible to other threads before store_body(),
cache_store_content() thus mod_cache might provide old content despite new content has been fetched.
httpd 2.2.x prefork/prefork.c, restart_pending=shutdown_pending=0; might be visible to child
threads after ap_mpm_run() set_singal(), thus if httpd receives SIGTERM, it will be
ignored while child processes are being spawned.
barnes load.c hackcofm() Done(p) might be visible to other threads before POS(p), etc.
raytrace taskman.c, putjob() The eld of a job might be visible to other threads after the job pointer.
in our test are all correct.
Static analysis: Since a static concurrency analysis is
difcult to be precise for C/C++ programs, we ap-
proximate the best fence insertion made by a compiler
analysis. We implement a conict detector to gure out
conict statements pairs that access the same memory
location between consecutive barrier operations. We
execute the detector until no more conict statements
are found after certain period of time. In fact, the
result mimics the best that can be achieved by a static
concurrency analysis [2].
Lock-set: We adopt lock-set algorithm by modifying
the race detector to see its effect of false positives on
PVSC detection.
Happens-Before: We adopt an approximated happens-
before race detection scheme similar to [37] which
could detect slightly more than true happens-before
data races for SPLASH-2 programs. Since happens-
before algorithm is theoretically the most accurate, we
evaluate it to see the accuracy potential of our approach.
Hybrid: The original algorithm adopted in Helgrind
[36], which combines the lock-set and the happens-
before detection algorithm.
As our prototype can only detect PVSC bugs that appear
in a particular execution, we keep on testing until no new
fence is inserted after a pre-determined period of program
execution with various inputs. We then determine whether
the inserted fences are sufcient to guarantee SC by man-
ually checking if race accesses that detected by happens-
before detector are inserted with enough fences. This ver-
ication is currently un-practical for large applications.
However, we are now evaluating how much redundancy
could be reduced by our scheme. The problem of sufciency
guarantee is left as our future work.
2) Number of inserted fence instructions: As illustrated
in Figure 7 and Figure 8, the number of static fences and
dynamic fences in our approach is reduced substantially
against static analysis for most benchmarks. Within the three
race detection algorithms, happens-before inserted the least,
1
10
100
1,000
10,000
w
a
te
r
-
n
s
b
a
r
n
e
s
fm
m
o
c
e
a
n
w
a
te
r
-
s
p fft
c
h
o
le
s
k
y lu
r
a
d
ix
F
e
n
c
e
s
Static Analysis Lock-set Hybrid Happens-Before
Figure 7: Static fence counts using different analyses.
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
w
a
te
r
-
n
s
b
a
r
n
e
s
fm
m
o
c
e
a
n
w
a
te
r
-
s
p fft
c
h
o
le
s
k
y lu
r
a
d
ix
F
e
n
c
e
s
Static Analysis Lock-set Hybrid Happens-before
Figure 8: Dynamic fence counts using different analyses.
0.8
0.9
1
1.1
1.2
1.3
1.4
w
a
t
e
r
-
n
s
b
a
r
n
e
s
f
m
m
o
c
e
a
n
w
a
t
e
r
-
s
p
f
f
t
c
h
o
le
s
k
y
lu
r
a
d
ix
a
v
e
r
a
g
e
N
o
r
m
a
liz
e
d

E
x
e
c
u
t
io
n

T
im
e
No_Fence Static Analysis Lock-set Hybrid Happens-Before
Figure 9: Performance on 4-core Intel Xeon Processor.
and lock-set inserted the most. Five of the ten programs
are inserted with no fences using the happens-before PVSC
detector.
3) Execution time after PVSC elimination: To see the per-
formance impact of our fence insertion scheme, we evaluate
the performance lost of the programs after PVSC elimination
using the originally intended level of compiler optimization
ags. We test SPLASH-2 programs on three different multi-
processors, 4-core Intel Xeon (2.33GHz, using xchgl as
fence), 4-core AMD Opteron (2.0 GHz, using mfence),
31
0.8
0.9
1
1.1
1.2
1.3
1.4
w
a
te
r
-
n
s
b
a
r
n
e
s
fm
m
o
c
e
a
n
w
a
te
r
-
s
p
fft
c
h
o
le
s
k
y lu
r
a
d
ix
a
v
e
r
a
g
e
N
o
r
m
a
liz
e
d

E
x
e
c
u
t
io
n

T
im
e
No_Fence Static Analysis Lock-set Hybrid Happens-before
Figure 10: Performance on SMP Intel Itanium.
2-processor SMP Itanium (1.6GHz, using xchg) with 4
threads. We compile the programs with GCC 3.4.6 using -O2
option both before and after fence insertion, and run them
on the Redhat Linux operating system (AS4.0 for Xeon and
Opteron, 9.0 for Itanium).
The performance comparison on the Intel Xeon and
Itanium processors is illustrated in Figures 9 and Figure
10, respectively. The data for AMD Opteron is quite similar
with that of Xeon, thus we omit it for the sake of space. The
performance slowdown of the scheme using static analysis
is the most serious for seven of the ten programs because
of the high dynamic counts of the fence instructions. For
the remaining three applications, the compiler with static
analyses works relatively well, because the inserted fences
are mostly fell on non-critical paths and the overhead has
very little impact on the total execution time. In contrast,
our race-detection based approach works well for almost
all applications except fft and lu with a lock-set algorithm.
We nd that in these two programs the fences are largely
inserted on hot functions. The high number of dynamic
counts degrades the performance. The average slow down
of the hybrid scheme is 6.3% on Xeon, 6.0% on Opteron
and 2.1% on Itanium SMP.
Another interesting nding is that on Itanium SMP pro-
cessors, the performance lost is nearly half that on Intel
Xeon or AMD Opteron. After examination, we nd that the
Itanium processor has a lower CPU frequency, while the
absolute time cost per fence instruction is about the same as
that on Xeon or AMD, thus the additional cost of fences on
Itanium is smaller percentage-wise compared to the overall
execution time. This result indicates that redundancy in fence
insertion would cause more performance lost as processors
upgrades.
C. Overhead evaluation
Our main approach is to base our PVSC bug detection
scheme on the existing data race detection techniques. The
results in Figure 11show that the additional cost of PVSC
bug detection over the base race detection is very low,
5.1% on average for lock-set and 3.3% for hybrid using
SPLASH-2 benchmarks. This is somewhat expected because
constructing the PC race graph is not expensive after data
races are detected, and nding cycles in the PC race graph
requires only polynomial time. Thus, if more optimized race

w
a
t
e
r
-
n
s
b
a
r
n
e
s
f
m
m
o
c
e
a
n
w
a
t
e
r
-
s
p
f
f
t
c
h
o
le
s
k
y
lu
r
a
d
ix
N
o
r
m
a
liz
e
d

D
e
t
e
c
t
io
n

T
im
e
No_PVSC Detection Lock-set Hybrid
Figure 11: Cost of PVSC detection over race detection.
Table II: False positives of PVSC detection scheme in 10 SPLASH-2
benchmarks. LS is short for lock-set, Hbr for hybrid, and HB for happens-
before. We determine each false positive by manually examining the source
code.
Prog. LS Hbr HB Prog. LS Hbr HB
MySQL 408 272 160 Apache 184 106 94
water-ns 81 50 0 ocean 573 124 0
water-sp 42 37 0 fft 73 73 0
barnes 116 64 25 fmm 444 178 6
raytrace 36 23 1 cholesky 140 56 8
lu 22 12 0 radix 22 15 2
detector other than Helgrind is used, the additional overhead
would still be low.
One thing needs to mention is that the PVSC detection
overhead using lock-set is slightly higher than that of using
the hybrid race detection algorithm. The reason is that
the former has a higher race detection rate but with a
lower overhead. Thus, the overhead of PVSC detection
scheme with lock-set contributes more percentage-wise to
the overall detection time than that of using the hybrid
scheme. As our approximate implementation of happens-
before algorithm could incur the execution time abnormally,
we did not provide its overhead of PVSC detection. A full
implementation is left as future work, and its overhead of
PVSC detection would not be that different as the additional
cost is still quite low.
D. False Positives
Table II shows the number of false positives introduced
by different race detection techniques on the SPLASH-
2 benchmarks and the two real applications. As we can
see, the happens-before scheme inserts signicantly fewer
unnecessary fences than that in the lock-set scheme for
SPLASH-2 programs. There are two reasons. First, lock-set
algorithm generally suffers from higher false positives; as a
result the corresponding PVSC detector will nd more false
PVSC bugs. Second, the implementation of the SPLASH-2
benchmarks heavily use barrier synchronizations that cannot
be handled by the lock-set algorithm, worsening the situa-
tion. However, for MySQL and Apache, the disparity on
false positives is rather small, since they mostly use lock syn-
chronizations that can be handled well by both techniques.
Another cause of false positives for both techniques comes
32
hash_delete(HASH *h,byte *r) {
blength=h->blength;
data=&h->array.buf;
pos=data+hash_mask(r);
gpos=0;
while (pos->data != r) {
gpos=pos;
pos=data+pos->next;
}
if(--(h->records)<
h->blength>>1)
h->blength>>=1;
lastpos=data+h->records;
empty=pos;
empty_idx=(uint) (empty-data);
if (gpos) { tmp = pos->next;
gpos->next=tmp;}
else {
empty_idx=pos->next;
empty=data+empty_idx;
tmp = empty->data;
pos->data=tmp;
pos->next=tmp;
}
tmp = empty->next;
if (array->elements) {
--h->array->elements;
return (h->array.buf +
h->array.elements *
h->array.size)
}
return 0;
}//end hash_delete
Thread 1
pthread_mutex_lock(L1);
hash_delete(&table1,r1);
pthread_mutex_unlock(L1);
Thread 2
pthread_mutex_lock(L2);
hash_delete(&table2, r2);
pthread_mutex_unlock(L2);
Program order edge
value: Conflict access
Delay and fence
(a) (b)
Figure 12: (a) hash delete code from MySQL 5.0.2/hash.c and (b) the threads that execute hash delete. With a conservative compiler analysis, at least 20
fences would be inserted. Our scheme does not insert any fence since no race cycle is found.
from the impreciseness of our delay computation algorithm,
which is discussed in section III-D. For example, among
the 30 fences inserted by the happens-before algorithm in
barnes, 25 of them are redundant for this reason. Although
the number of false positives seems rather high as listed,
the performance lost is tolerable even for the lock-set. This
is because the dynamic count on fences is actually rather
low compared to the total memory access count, and thus,
the impact of dynamic fences is small percentage-wise
compared to the overall execution time. This result indicates
that it is still fairly affordable even to insert all the fences
generated by our scheme without further optimization.
V. RELATED WORK
Language Memory Model Emerging language-level
memory models, such as Java Memory Model [39] and
C++ Concurrency Model [26], suggest programmers to use
volatiles or atomics instead of explicit fences to
impose ordering. These models, however, are still under
development [40] and are not supported by most compilers
yet. Further, even if they become widely adopted, our
tool could still help programmers to identify data variables
that should be marked as volatile or atomic, e.g. by
marking the variables accessed in race cycles as volatile.
Data Race Detection Previous work in data race detection
can be divided into dynamic and static approaches. Dynamic
detection includes lock-set [7], [13], [14], happens-before
[9], [10], [12] and the hybrid schemes using both [8],[15],
[6], [20]. Some work took special consideration on weak
memory models [19]. Static data race detection techniques
generally require type-safe systems [16], [17]. Tools have
also been developed to classify data races [11], [21]. How-
ever, all of the data race detectors, as discussed in section I,
do not help directly in the PVSC detection and elimination.
Verication Verication tools [25], [24] aim at inserting
fence instructions accurately. These tools take the concurrent
program and a relaxed memory consistency model, e.g.
TSO [24], as inputs, then enumerate all possible execution
patterns and simulate them according to the memory consis-
tency model. Fences can be inserted according to the execu-
tions that lead to non-SC results. Verication tools work well
for relatively small applications that involve a small number
of memory accesses. However, even with some proposed
optimization techniques [24], they still cannot handle large
applications with many shared-memory accesses.
Compiler Analysis Compiler techniques [2], [3] statically
analyze a concurrent program and identify all possible
concurrent accesses to shared memory locations. Then,
primarily based on Shasha/Snirs algorithm [1], a delay set
is computed. Finally, fences are inserted (with some possible
optimizations [4], [3]) according to the delay set. Compiler
approaches could be quite effective for strong-typed pro-
grams with simple synchronization support [2]. However,
they could be quite conservative for general concurrent
C/C++ programs that are hard to analyze statically be-
cause of pointer aliasing and more complex synchronization
schemes. As shown in Figure 12, the compiler could at least
identify 20 possible delays and fences if it could not gure
out that hash delete is actually correctly synchronized with
locks by the callers from different threads. The unnecessary
fences could badly hurt the performance.
Other concurrency bug detection schemes Atomicity
violation (serializability violation) detection has been studied
in recent years [29], [30]. MUVI proposed in [32] identies
correlate variables and can detect concurrency bugs associ-
ated with different variables. However, as seen in section I,
the nature of PVSC bugs is different from that of atomicity
violation bugs, thus these tools cannot help. Although PVSC
bugs are not characterized in [31], we believe they are
important due to the subtleness and difculties in detecting
such bugs.
VI. CONCLUSION
In this paper, we proposed an effective and efcient
scheme to detect and eliminate bugs called potential vio-
lations of sequential consistency (PVSC) using existing data
race detection techniques. A PVSC bug refers to a series
of data races that might lead to a non-sequential-consistent
execution, and can be eliminated by inserting fences.
Compared with static compiler analysis schemes, our ap-
proach has a less impact (less than 6.3%) on the performance
of the original concurrent programs because unnecessary
fences are substantially reduced. Compared with some ex-
isting verication tools, our approach is more scalable. We
33
have detected and eliminated PVSC bugs on some real-world
applications, such as MySQL, Apache, SPLASH-2, and Cilk
Programs, with our implemented prototype. Moreover, the
cost of our scheme over race detection is low, with 3.3% on
average.
Our approach inherently suffers from limitations of data
race detection. However, with the improvement of data
race detecting techniques, our approach would show more
potential, since it only requires a bit more extension to them.
ACKNOWLEDGMENT
This paper is supported by a project of the Nation
Basic Research Program of China (No. 2005CB321602),
a project of the National Natural Science Foundation of
China (No. 60736012) and a project of the National High
Technology Research and Development Program of China
(No. 2007AA01Z110).
REFERENCES
[1] D.Shasha, M.Snir, Efcient and correct execution of parallel programs
that share memory. ACM Trans. Program. Lang. Syst.,10(2):282-
312,1988.
[2] A.Kamil, J.Su, K.Yelick, Make Sequential Consistency Practical in
Titanium, Proc. of the ACM/IEEE SC 2005 Conf. Supercomputing,
2005.
[3] X.Fang, J.Lee, S.P.Midkiff, Automatic Fence Insertion for Shared
Memory Multiprocessing, Proc. of the Intl. Conf. on Supercomputing,
2003.
[4] J.Lee, D.A.Padua, Hiding Relaxed Memory Consistency with a Com-
piler. Proc. of Intl Conf. on Parallel Architectures and Compilation
Techniques, 2000.
[5] W.Y.Chen, A.Krishnamurthy, K.Yelick, Polynomial-Time algorithms
for Enforcing Sequential Consistency in SPMD Programs with Arrays.
In Languages and Compilers for Parallel Computing, 2003.
[6] Y.Yu, T.Rodeheffer, W.Chen, RaceTrack: Efcient Detection of Data
Race Conditions via Adaptive Tracking. In 20th ACM Symposium on
Operating Systems Principles, 2005.
[7] J.-D.Choi et al. Efcient and precise data race detection for multi-
threaded object-oriented programs. In Proc. of Programming Language
Design and Implementation, 2002.
[8] R.O.Callahan, J.-D.Choi, Hybrid Dynamic Data Race Detection. In
Principles and Practice of Parallel Programming, 2003.
[9] A. Dining and E.Schonberg. An empirical comparison of monitoring
algorithms for access anomaly detection. In Principles and Practice of
Parallel Programming, 1990.
[10] R.H.B.Netzer and B.P.Miller. Improving the accuracy of data race
detection. In Principles and Practice of Parallel Programming, 1991.
[11] S.Narayannasamy, Z.Wang, J.Tigani, A.Edwards, B.Calder. Automat-
ically classifying benign and harmful data races using replay analysis.
In Programming Language Design and Implementation, 2007.
[12] D.Perkovic and P.J.Keleher. Online data-race detection via coherency
guarantees. In Operating System Design and Implementation, 1996.
[13] S.Savage, M.Burrows, G.Nelson, P.Sobalvarro, and T. Anderson.
Eraser: A dynamic data race detector for multithreaded programs. In
ACM Tran. On Computer System, 1997.
[14] C. von Praun and T.R.Gross. Object race detection. In Object-Oriented
Programming, Systems, Languages and Applications, 2001.
[15] E.Pozniansky and A.Schuster. Efcient on-the-y data race detection
in multithreaded c++ programs. In Principles and Practice of Parallel
Programming, 2003.
[16] C.Boyapati. R.Lee, and M.Rinard. Owership types for safe pro-
gramming: Preventing data races and deadlocks. In Object-Oriented
Programming, Systems, Languages and Applications, 2002
[17] C.Flanagan and S.N.Freund. Type-based race detection for java. In
Programming Language Design and Implementation, 2000.
[18] K.Gharachorloo, P.B.Gibbons, Detecting violations of sequential con-
sistency. In Symposium on Parallel Algorithms and Architectures,
1991.
[19] S.V.Adve, M.D.Hill, B.P.Miller, R.H.B.Netzer, Detecting data races on
weak memory systems. In Intl. Symposium on Computer Architecture,
1991.
[20] M.Prvulovic, CORD: Cost-effective (and nearly overhead-free) Order-
Recording and Data race detection. In High Performance Computer
Architecture, 2006.
[21] M.Prvulovic and J.Torrellas. ReEnact: Using thread-level speculation
mechanisms to debug data races in multithreaded codes. In Intl.
Symposium on Computer Architecture, 2003.
[22] S.L.Min and J.-D.Choi. An efcient cache-based access anomaly de-
tection scheme. In Architectural Support for Programming Languages
and Operating Systems, 1991.
[23] J.D.Choi, S.L.Min, Race Frontier: Reproducing Data Races in Parallel
Program Debugging, In Principles and Practice of Parallel Program-
ming, 1991.
[24] S.Burckhardt, M.Musuvathi, Effective Program Verication for Re-
laxed Memory Models. In Computer Aided Verication, 2008.
[25] S.Burckhardt, R.Alur, M.M.K.Martin, CheckFence: checking con-
sistency of concurrent data types on relaxed memory models. In
Programming Language Design and Implementation, 2007.
[26] H.J.Boehm, S.V.Adve, Foundations of the C++ Concurrency Memory
Model, In Programming Language Design and Implementation, 2008.
[27] L.Lamport. How to make a multiprocessor computer that correctly
executes multiprocess programs. In IEEE Tran. On Computer, 1979.
[28] S.V.Adve, K.Gharachorloo, Shared Memory Consistency Models: A
tutorial. In IEEE computer, 1995.
[29] S.Lu, J.Tucek, F.Qin, Y.Y.Zhou. AVIO: Detecting atomicity violations
via access interleaving invariants. In Architecture Support for Program
Languages and Operating Systems, 2006.
[30] M.Xu, R.Bodik, M.Hill. A serializability violation detector for shared-
memory server programs. In Programming Language Design and
Implementation, 2005.
[31] S.Lu, S.Park, E.Seo, Y.Y.Zhou. Learning from mistakes-a compre-
hensive study on real world concurrency bug characteristics. In Archi-
tecture Support for Programming Languages and Operating Systems,
2008.
[32] S.Lu, S.Park, C.Hu, X.Ma, W.Jiang, Z.Li, R.Popa, Y.Y.Zhou. MUVI:
automatically inferring multi-variable access correlations and detecting
related semantic and concurrency bugs. In Symposium on Operating
System Principles, 2007.
[33] D.Schmidt, T.Harrison. Double-checked-locking: an optimization pat-
tern for efciently initializing and accessing thread-safe objects. In
Programming Language Design and Implementation, 1996.
[34] The Double-checked-locking is broken declaration. http://www.
cs.umd.edu/ pugh/java/memoryModel/DoubleCheckedLocking.html.
[35] M.Frigo, C.E.Leiserson, K.H.Randall. The Implementation of the
cilk-5 multithreaded language. In Programming Language Design and
Implementation, 1998.
[36] N. Nethercote and J. Seward. Valgrind: A Program Supervision
Framework. Electr. In Notes Theor. Comput. Sci., 2003.
[37] P.Zhou, R.Teodorescu, and Y.Zhou. Hard: Hardware-assisted lockset-
based race detection. In High Performance Computer Architecture,
2007.
[38] Intel 64 Architecture Memory Ordering White Paper.
http://developer.intel.com/products/processor/manuals/318147.pdf
[39] J.Manson, W.Pugh and S.Adve. The Java memory model. In Proc.
Symp. on Principles of Programming Languages, 2005.
[40] D.Aspinall and J.Sevcik. Java Memory Model ex-
amples: Good, bad, and ugly. VAMP07 Proceedings
http://www.cs.ru.nl/ chaack/VAMP07/,2007.
34

You might also like