You are on page 1of 8

CS3211 Project 1

Wu Wenqi A0124278A


PART 1
1.1 Hardware
The hardware comprises a lab machine and the Tembusu
cluster. The lab machine runs on a quad core Intel i7-2600
processor with Ubuntu 14.04 LTS.
Unlike concurrency, parallelism is bound by the number of
processing cores available. The number of processes which
can be run in parallel cannot be more than the cores available,
assuming there is no simultaneous multithreading. If the
program utilizes any more processes than available cores,
some of the cores will run multiple processes in a serial manner
via context switch. This will not lead to added parallelism, and
there will not be any performance gain to be reaped from this.
For this write-up, it is assumed that all threads are run on
separate cores. As such, cores, processors, processes and
threads all refer to the parallel execution of multiples threads on
separate cores and may be used interchangeably.
The lab machine and Tembusu cluster were given a matrix
multiplication problem of varying size to run with different
number of threads using the program mm-shmem. Matrix
multiplication is a problem that can be easily parallelized since
the matrix can be partitioned into smaller block sizes of p,
where p is the number of processes.
We define two experiments to be the same if the problems are
of the same size and the same number of threads are used.
Each experiment was run for 3 iterations, and the runtime was
found to be abnormally long for some of the iterations (see
Figure 1.1.1 and 1.1.2). It is not clear as to why but a possible
cause could be external processes running in the machine. For
example, other users could have logged on to the same lab
machine or Tembusu and were competing for CPU time slices.
Size
Threads
1

128

256

512

1024

2048

considered, with the assumption that the minimum runtime is a


good approximation for the optimal runtime under identical
conditions.
Size
Threads
1

128

256

512

1024

2048

0.0323
0.1856
1.2223
8.9999
112.2641
0.0332
1.2161
109.9655
0.1813
8.9704
0.0320
0.1846
1.1944
8.9881
109.6149
2
4.6102
55.3838
0.0171
0.1064
0.7077
4.6227
55.1213
0.0630
0.1101
0.7191
0.0172
0.1123
2.5302
4.4945
54.7532
4
0.0620
0.3845
2.5900
0.0090
27.4877
0.0625
2.6020
0.0090
0.3731
79.8925
0.0090
0.0619
3.2281
2.4220
28.2584
8
0.0348
0.2103
1.4655
66.6549
0.0045
1.3957
0.0376
0.0336
0.2101
15.3964
0.0046
0.0342
0.2101
1.3953
15.6514
16
0.0046
0.0347
0.9479
1.1386
47.0230
0.0046
0.1916
51.6956
0.0321
1.1300
0.0045
0.1957
0.1912
1.1681
10.6101
32
0.0114
0.0269
0.1638
1.1365
9.4150
0.0047
0.1252
1.0927
0.1498
9.3256
0.0036
0.0268
0.7528
1.0764
9.4156
40
0.0272
0.1449
26.4869
0.0038
1.0062
9.4006
0.0044
0.0247
0.1380
1.0109
0.0041
0.1122
0.1420
1.0827
9.2925
Figure 1.1.2: Tabulated result for time (in seconds) to run mm-shmem
using Tembusu. Lowest timing for a particular thread and problem size
is bolded.

It has been discussed earlier that having more processes than


available cores will not lead to added parallelism. However,
some superscalar processors, such as the Intel i7-2600 on the
lab machine, are able to carry out simultaneous multithreading,
which allows instructions to be issued from multiple threads on
a single core. This is why there is an improvement in runtime
when thread size doubles from 4 to 8 for mm-shem, even though
the processor on the lab machine has only 4 cores (see Figure
1.1.1).

0.0186 0.1193 0.9327 8.9612 102.9916


0.0189 0.1078 0.9315 8.8607 91.2449
0.0209 0.1044 0.9343 8.9087 103.3026
2
0.0094 0.0543 0.4772 4.5599 52.5290
0.0091 0.0543 0.4793 4.5756 52.1254
0.0095 0.0556 0.4775 4.5607 52.8648
4
0.0066 0.0522 0.4482 2.4031 32.8559
0.0075 0.0519 0.2521 3.9799 28.2128
0.0077 0.0309 0.4431 3.9673 27.1650
8
0.0042 0.0304 0.2360 2.1543 20.7708
0.0052 0.0294 0.2383 2.1345 21.1138
0.0047 0.0715 0.2406 2.2571 31.3899
Figure 1.1.1: Result of timings (in seconds) to run mm-shmem using lab
machine. Lowest timings are in bold.

Simultaneous multithreading contributes to parallelism only if


the execution unit is idle for the current thread, such as when
there is a cache miss and instruction level parallelism is
unavailable (all independent instructions for current thread has
been exhausted). As such, it will not be as effective as running
the threads on separate cores. If we can assume the OS to
preferentially assign threads to separate physical cores first,
this can explain why runtime is almost halved as number of
threads doubles for thread sizes in the range between 1 to 4,
holding problem size constant; however, when thread size
doubles from 4 to 8, the runtime is only cut by at most 36% (see
Figure 1.1.1).

Average runtime for the experiments are not considered since


abnormally long runtimes are not indicative of the actual
performance. Needless to say, the maximum runtime should
not be considered too. Instead, the minimum runtime is

The results in Figure 1.1.1 suggests that runtime for a problem


improves with greater parallelism, but this is usually not the
case. When mm-shmem is run on Tembusu, it is observed that
the timing improves less than proportionately for thread sizes

CS3211 Project 1

Wu Wenqi A0124278A

1.2 Speedup
Amdahls law
"#$%&'( () =

1
1+

Gustafsons law
"#$%&'( () = 1 +
Figure 1.2.1: For both formulae, Slatency refers to the speedup in latency
of the program execution. s is the speedup in latency of the part of
program that benefits from parallel execution. p is the percentage of
program runtime that will benefit from parallel execution if the program
were running serially.

With Amdahls law (see Figure 1.2.1), we hold the size of the
problem to be constant and calculate the speedup when thread
size increases. We see that the speedup of mm-shmem increases
less than proportionately as the thread size doubles (see Figure
1.2.2). The speedup trend is observed to be logarithmic (see
Figure 1.2.3). This suggests that there is an upper bound for the
maximum speedup of a problem using parallel processing if the
problem size is held constant.
Size 128
256
512
1024
2048
Threads
1
1
1
1
1
1
2
1.87
1.70
1.68
2.00
2.00
4
3.56
2.93
3.20
3.70
3.99
8
7.11
5.39
5.68
6.43
7.11
16
7.11
5.64
6.25
7.94
10.33
32
8.89
6.74
7.97
8.33
11.75
40
8.42
7.34
8.66
8.92
11.80
Figure 1.2.2: Table showing speedup of Tembusu running mm-shmem as
number of cores (threads) increases

Thread size
1
2
4
8
Problem size
480
600
740
890
Actual runtime
1.0437
1.0538 0.9821 0.9497
Figure 1.2.4: Table showing problem size of mm-shmem that will result in
a runtime of 1.0 0.06 seconds on Tembusu for varying thread sizes

800000000
600000000
400000000
200000000
0
0

10

No of processors (threads)
Figure 1.2.5: Graph showing problem size of mm-shmem that will result
in a runtime of 1.0 0.06 seconds on Tembusu for varying thread sizes

Figure 1.2.5 is a graphical representation of the obtained


results. Originally, the problem size refers to the matrix order.
However, this is not equal to the actual work done. Since mmshmem does matrix multiplication within a triple-nested for loop,
the actual amount of work done is directly proportional to the
cube of the matrix order. For now, we refer to the problem size
as the actual amount of work done by the processor. From the
graph (see Figure 1.2.5), we observe a linear speedup which is
ideal in parallel processing.
With Gustafsons law, we are presented with a less pessimistic
assessment of parallel performance. This is also more
applicable since an increase in computational power is usually
used to solve larger problems, rather than solving the same
problem within a shorter time.

15
Speedup

If we scan Figure 1.2.2 row-wise, we observe that the speedup


generally improves as problem size increases, holding thread
size constant. Besides increasing threads, increasing the
problem size can also improve the speedup in latency.
Application of Amdahls law is limited here since it requires the
problem size to be fixed. As such, we will consider Gustafsons
law, which provides a more realistic assessment of parallel
performance by considering the speedup under constant time
instead of constant problem size. To apply Gustafsons law, I
attempted to find the problem size that will result in a runtime
of 1 second for varying thread sizes. As it is challenging to get
an exact runtime for each program, an error margin of 0.06
seconds is allowed. The results are presented in Figure 1.2.4.

Amount of work

of 16 and above; at thread size of 40, the performance gain from


additional threads is almost negligible (see Figure 1.1.2). At this
stage, the overhead incurred from added parallelism offsets any
benefit that can be gained from it. Such overhead costs include
shared memory, cache synchronization (for per-core caches)
and competition for memory bandwidth. As parallelism
increases, such overheads will also increase.

10
5
0
0

10

20

30

40

50

No of processors (threads)
Size 128

Size 256

Size 512

Size 1024

Size 2048

Figure 1.2.3: Graph showing speedup of Tembusu running mm-shmem


with varying processors (threads) used

1.3 Memory effects


In computing, there is a growing disparity of speed between the
processor and the main memory, known as the memory wall.
To bridge this gap, caches are used to store recently accessed
pages, and by doing so, leverage on the spatial and temporal
locality of data.

CS3211 Project 1

Wu Wenqi A0124278A


Before retrieving data from the main memory, the processor
first checks its cache to determine if it contains a copy of that
data. If so, a cache hit happens and the processor is able to
retrieve the data directly from the cache. However, if the cache
does not contain that data, the processor has to request for it
from the main memory. This is known as a cache miss, and
incurs the penalty of additional delay.
Array
Reading 1 Reading 2 Reading 3 Average
size (kB)
(sec)
(sec)
(sec)
(sec)
4
0.131194
0.131179
0.132424
0.131599
8
0.124843
0.124554
0.124934
0.124777
16
0.125007
0.124953
0.124902
0.124954
32
0.125009
0.124575
0.124900
0.124828
64
0.132756
0.132920
0.132894
0.132857
128
0.133108
0.133114
0.133099
0.133107
256
0.136035
0.135417
0.135805
0.135752
512
0.163580
0.162282
0.159539
0.161800
1024
0.166573
0.165959
0.165906
0.166146
2048
0.167007
0.165969
0.165975
0.166317
4096
0.166615
0.165706
0.165597
0.165973
8192
0.254448
0.250943
0.253606
0.252999
16384
0.464711
0.464884
0.463626
0.464407
32768
0.467734
0.467380
0.467257
0.467457
65536
0.465453
0.466999
0.463359
0.465270
131072
0.465964
0.464905
0.464421
0.465097
Figure 1.3.1: Tabulated result of timing (in seconds) to run testmem
using lab machine.

the bottom half of data, and the top half will need to be reloaded
into cache from main memory. This causes tremendous
amounts of cache misses and leads to significantly longer
runtime. When testmem is run on an Ubuntu virtual machine
with cache size 3072 kB, runtime increases significantly
between problems with array size 2048 kB and 4096 kB. We
can see that once the array size exceeds the cache size, cache
misses becomes more frequent, leading to longer runtime.
for (i = 0; i < steps; i++)
{
arr[(i * 16) & lengthMod]++;
}

Figure 1.3.3: Snippet of testmem showing how the array is accessed

This does not imply that we should limit data to cache sizes.
Instead, we should consider how the sequence in which data is
accessed can be changed to improve the caches hit ratio. For
example, instead of traversing the array several times (see
Figure 1.3.3), we can traverse it only once (see Figure 1.3.4).
Also, we should preferentially access data which is nearer or
has been accessed recently, since there is a higher likelihood
for it to be cached.
int q = 16 * steps / lengthMod;
int r = 16 * steps % lengthMod;
for (i = 0; i < lengthMod; i+=16){
arr[i] += q;
if (i <= r) arr[i]++;
}

Figure 1.3.4: Improved method of accessing array in testmem

Average time
(seconds)

The effectiveness of caches is not solely dependent on the


caching algorithm; it can also be affected by the program. To
demonstrate this, testmem was run on a lab machine, and the
results are presented in Figure 1.3.1. A graph representation of
the results is shown in Figure 1.3.2. testmem reads data
sequentially from a varying range of contiguous memory
(simulated by an array) for several iterations.

0.6
0.4
0.2
0
0

10

20

30

log (array size)


Figure 1.3.2: Graph representation of average timing to run testmem
using lab machine.

Runtime significantly increases when the array size increases


from 8192 kB to 16384 kB (see Figure 1.3.2). It is not
coincidental that the processor on the lab machine has a cache
size of 8192 kB. For an 8192 kB array, it is still possible to cache
most of the data and have cache hits. However, when the array
size is doubled to 16384 kB, less than half the data can be
cached. After the top half of the data has been read, the bottom
half will need to be loaded into cache from main memory before
reading. At the start of the next iteration, the cache will contain

1.4 Accuracy
In computers, numbers are represented externally as decimal
yet they are stored and processed as binary. This is the
fundamental reason for the observed peculiarities in floating
point arithmetic. Some decimal numbers, while appearing
simple, are unable to be represented in exact form as binary.
For example, a floating point number cannot give an exact
representation of the decimal 0.1.
[sign: 1 bit] [exponent: 8 bits] [mantissa: 23 bits]
Figure 1.4.1: Format of a 32-bit floating point number. A 0 bit for sign
means the number is positive while a 1 bit for sign means the number is
negative. The exponent has a bias of 127, i.e. the zero offset is 127 in
decimal or (0111 1111)2. The smallest exponent is -127 and the largest
exponent is 128.

We look at the fpadd1 program which adds 1 repeatedly to a


floating point number whose value was initially 0. After running
the program several iterations, it is observed that any addition
of 1 to the floating point number has an upper bound of
16777216.0. In binary, this is represented as:
[0] [1001 0111] [000 0000 0000 0000 0000 0000]
Unlike the real number system, the floating point is not
continuous due to its finite size. The greatest precision of
accuracy for the mantissa in a 32-bit floating point number is 2
24

CS3211 Project 1

Wu Wenqi A0124278A


bits (including the hidden bit). This can represent approximately
8 significant figures in decimal. A number will be rounded off to
the nearest representable value if it cannot be represented in
exact form. Since the mantissa is fixed to 2 bits, there is a
tradeoff between the range of a floating point number and its
precision. When a small number is added to a large number, its
value may be overlooked if its exponent is too small.
24

Adding
Adding
Adding
Adding
Adding

16000000
16500000
17000000
17500000
18000000

1s
1s
1s
1s
1s

to
to
to
to
to

0
0
0
0
0

gives
gives
gives
gives
gives

this
this
this
this
this

result: 16000000.0
result: 16500000.0
result: 16777216.0
result: 16777216.0
result: 16777216.0

Figure 1.4.2: Result of running fpadd1 on lab machine

For example in fpadd1, when 1 is added to 16777216, the result


remains 16777216 (see Figure 1.4.2). The correct result should
be 16777217, or (1 0000 0000 0000 0000 0000 0001) . We see
that 16777217 requires 25 bits for exact binary representation;
however, the mantissa of a 32-bit floating point number only
allows for 24 bits of precision. 16777217 is rounded off to the
nearest representable value, which so happens to be
16777216. This is known as a rounding error. Figure 1.4.3
shows another scenario where the addition of a small number
to a large number results in inaccuracy due to rounding of the
result to the nearest value that can be represented by a floating
point number.
2

Adding 500000 1s to 0.3 gives this result:


Adding 1000000 1s to 0.3 gives this result:
...
Adding 3000000 1s to 0.3 gives this result:
Adding 3500000 1s to 0.3 gives this result:
...
Adding 4500000 1s to 0.3 gives this result:
Adding 5000000 1s to 0.3 gives this result:
Figure 1.4.3: Result of running fpadd2 on lab machine

500000.3
1000000.3
3000000.2
3500000.2
4500000.0
5000000.0

In the presence of rounding errors, floating point arithmetic is


not associative. A change in the order of operations results in
unequal sums 60% of the time (see Figure 1.4.4).
Reading
Sum from 1 to 20
Sum from 20 to 1
1
587.750305
587.750366
2
595.166748
595.166748
3
583.424011
583.424072
4
655.115967
655.115967
5
669.330811
669.330872
Figure 1.4.4: Result of running fporder on lab machine

fpomp is a program that finds the sum for an array of floating


point numbers by partitioning the addition problem into
separate threads. Each thread will find the sum for their
individual partition before combining their results to get the
overall sum. While there are no logical errors, we can observe
some inconsistency in results (see Figure 1.4.5 and 1.4.6).

Reading 1
Reading 2
Reading 3
3062545.500000
3063308.000000
3065318.500000
3062545.500000
3063308.000000
3065318.250000
3062545.500000
3063308.000000
3065318.250000
3062545.750000
3063307.750000
3065318.250000
3062545.750000
3063308.000000
3065318.250000
Figure 1.4.5: Result of running fpomp on lab machine using 8 threads
Reading 1
Reading 2
Reading 3
3065761.000000
3048338.500000
3057366.500000
3065760.500000
3048338.750000
3057366.250000
3065761.000000
3048338.250000
3057366.000000
3065760.250000
3048338.500000
3057366.500000
3065760.500000
3048338.250000
3057366.500000
Figure 1.4.6: Result of running fpomp on Tembusu using 24 threads

This is because for each iteration, the threads are not always
assigned the same partition of numbers to add, and they do not
always add their individual sums to the overall sum in the same
order. This leads to a change in the order of addition operations
performed for each iteration. In the presence of rounding errors,
floating point arithmetic is not associative; any change in order
of operations can lead to inconsistency of results. A
comparison between Figure 1.4.5 and Figure 1.4.6 suggests
that inconsistency of results worsens when more threads are
used.
Inconsistency of results stems from the non-deterministic
execution order of threads in parallel programs. For example, a
change in the order of operations for fpomp leads to multiple
results. If consistency is key, additional rules as to how
parallelism is carried out should be implemented. For example,
we can fix the way tasks are assigned to threads and the order
in which individual sums are added in fpomp to ensure
consistency.
If accuracy is key, we have to ensure that the set of possible
values for all variables can be represented with sufficient
precision and rounding errors are minimized. The simplest way
is to use a larger-sized floating point number, such as a double
or a long double. However, there is always a trade-off between
range and precision for floating point numbers. If our program
comprises floating point operations between large and small
numbers, and high precision of result is required, it may be
better to do away with floating point numbers altogether.
Instead, other data types can be considered besides floating
point when storing real numbers. For example, the unum
datatype allows any number to be stored using variable number
of bits. This enables it to be smaller than float when
performance is concerned and also more precise than float
when accuracy is needed.

CS3211 Project 1

Wu Wenqi A0124278A


1.5 Communication and speedup
No. of
slave
processes

Communication time for Computation time for


slave process (sec)
slave process (sec)
Min
Max
Avg
Min
Max
Avg
8
0.39
1.45
0.87
16.75 17.26
17.11
16
0.48
8.84
1.91
13.93 16.29
14.10
24
0.70
4.92
2.78
8.39
12.48
11.41
32
1.12
4.48
3.17
8.94
12.83
11.45
40
3.30
7.77
5.98
8.17
12.57
11.24
48
4.47
7.11
6.06
8.49
11.68
10.58
56
8.18
10.90 9.82
9.56
12.29
11.59
64
17.49
20.29 18.86
8.87
12.21
10.67
Figure 1.5.1: Tabulated result of average communication and
computation times of slave processes when mm-mpi is run on Tembusu

mm-mpi, a parallelized matrix multiplication problem


implemented with OpenMPI was run on Tembusu to assess its
parallel performance. With increased number of processors,
average communication time for each slave node increases
while computation time for each node will decreases (see
Figure 1.5.1).
Fewer messages will be sent and received by each slave
process with greater number of slave processes. As such, we
expect the communication time for each slave process to
decrease as number of slave processes increases. However, it
turns out that average communication time increases with more
slave processes. The master process uses MPI_Send and
MPI_Recv, both of which are blocking calls, to communicate
with the slave processes in order of increasing slave process
ID. This means that even when a slave process is ready to send
its results to the master process, it is unable to do so until the
master process has received results from all the slave
processes before it, causing added delays in communication
time. Communication time is generally shorter when the master
process sends and receives messages from fewer slave
processes, even though the total number of messages passed
is similar.
With more slave processes, each slave process will be
allocated a smaller portion of work. Hence, we expect the
average computation time to decrease with more slave
processes. We do see such a trend in Figure 1.5.1. However, it
is noted that average computation time stops decreasing once
past 48 slave processes.
Studying the source code, it is found that each slave process
calls allocate_matrix prior to doing matrix multiplication.
allocate_matrix appears to be serial since the size of the
matrix allocated is independent of the number of slave
processes. Initially, the thought came across that
allocate_matrix is the cause of blame for poor scaling. This
will also explain why mm-shmem performs better in comparison
since the matrix only needs to be allocated once and stored in
shared memory.

However, it is found that allocate_matrix only takes up a tiny


portion of computation time. It is slave_compute that takes up
most of the computation time. Strangely, its runtime does not
see a noticeable decrease when the rows processed by each
slave decreases for larger numbers of slave processes.
This OpenMPI implementation of matrix multiplication does not
scale well on multiple processors. However, there are several
changes we can make to improve its communication time. We
can use a logarithmic fan-out architecture for passing
messages instead of having the master process communicate
serially with each slave process. The time complexity for total
communication time will be (log ) instead of () where n is
the number of slave processes. Average communication time
for slave processes will also improve.

PART 2 - Parallelized Numerical Integration


2.1 Speedup
We cannot directly use the runtime as a gauge for the
performance of integralMP and integralMPI because they
were run on different machines. However, it is observed that
both implementations experienced shorter runtimes with
increased threads utilized.
Under Amdahls law, both implementations scale sublinearly
with the increase in number of threads, holding problem size
constant. For example, integralMP with 220 strips running on
a lab machine has its runtime only improve by a factor of 2.37,
when its thread size has been increased by a factor of 8 (see
Figure 2.1.2).
Threads
1
2
4
8
integralMP with 220 strips
0.0197 0.0131 0.0107 0.0083
integralMP with 222 strips
0.0719 0.0427 0.0275 0.0199
integralMP with 224 strips
0.2725 0.1404 0.0853 0.0529
integralMP with 226 strips
1.0751 0.5565 0.3022 0.1720
integralMP with 228 strips
4.2982 2.2101 1.2238 0.6543
integralMPI with 220 strips 0.0430 0.0230 0.0119 0.0061
integralMPI with 222 strips 0.1319 0.0789 0.0440 0.0233
integralMPI with 224 strips 0.4375 0.2343 0.1451 0.0762
integralMPI with 226 strips 1.3524 0.7909 0.4492 0.2379
integralMPI with 228 strips 5.1215 2.6807 1.4548 0.8273
Figure 2.1.1: Runtime (in seconds) for both programs on a single
processor machine. integralMP was run on a lab machine while
integralMPI was run on Tembusu.

CS3211 Project 1

Wu Wenqi A0124278A


Threads
1
2
4
8
integralMP with 220 strips
1
1.50 1.84 2.37
integralMP with 222 strips
1
1.68 2.61 3.61
integralMP with 224 strips
1
1.94 3.19 5.15
integralMP with 226 strips
1
1.93 3.56 6.25
integralMP with 228 strips
1
1.94 3.51 6.57
integralMPI with 220 strips
1
1.87 3.61 7.05
integralMPI with 222 strips
1
1.67 3.00 5.66
integralMPI with 224 strips
1
1.88 3.02 5.74
integralMPI with 226 strips
1
1.71 3.01 5.68
integralMPI with 228 strips
1
1.91 3.52 6.19
Figure 2.1.2: Speedup for both programs on a single processor
machine. integralMP was run on a lab machine while integralMPI
was run on Tembusu.

was made to find the problem size (strip count) that will result
in a runtime of 1 second for varying thread sizes. As it is
challenging to get an exact runtime of 1 second, a margin of
error is allowed. The results are presented in Figures 2.1.4,
2.1.5, 2.1.6 and 2.1.7.
Thread size
1
2
4
8
Strip count (in 107) 6.215 12.115 21.5
41.5
Actual time (sec)
1.002 1.001
1.008
1.004
Figure 2.1.4: Table showing problem size of integralMP that will result
in a runtime of 1.0 0.008 seconds on lab machine for varying thread
sizes

Strip count

600000000
Studying Figure 2.1.2 column-wise, we see that the speedup of
integralMP improves for same thread size when the problem
size increases. This seems to suggests that the program has
better parallel scaling if the problem size is not held constant,
but allowed to increase with thread size.

400000000
200000000
0
0

10

No of processors (threads)
10
Figure 2.1.5: Graph showing strip count of integralMP that will result
in a runtime of 1.0 0.008 seconds on lab machine for varying thread
sizes

6
4

Thread size
1
2
4
8
Strip count (in 107) 5
10
18
20
Actual time (sec)
1.020 1.028 1.020 0.992
Figure 2.1.6: Table showing problem size of integralMPI that will
result in a runtime of 1.0 0.03 seconds on Tembusu for varying thread
sizes

2
0
0

10

No of processors
2^20

2^22

2^24

2^26

2^28

300000000

Ideal

Figure 2.1.3: Graph representation of speedup for integralMP on lab


machine. The legends on the bottom denote the strip count for that run.

From Figure 2.1.3, it is observed that the speedup for a fixed


problem size (strip count) follows a logarithmic trend; for greater
number of processors, there is negligible improvement in
speedup as number of processors increase. If the problem size
were to remain constant, it seems almost impossible to achieve
better-than-logarithimc speedups.
However, if we allow the problem size to increase with number
of processors, linear speedup is possible. For example, with 4
processors, speedup is sublinear for a problem of 2 strips,
however it becomes linear when problem size is increased to 2
strips (blue line cuts the green line) (see Figure 2.1.3).
20

Strip count

Speedup

200000000
100000000
0
0

10

No of processors (threads)
Figure 2.1.7: Graph showing strip count of integralMPI that will result
in a runtime of 1.0 0.03 seconds on Tembusu for varying thread sizes

There is an almost linear speedup in latency of integralMP


while holding time constant (see Figure 2.1.5). However, the
same cannot be said of integralMPI, whose problem size
seems to increase logarithmically with number of threads.

28

As discussed earlier in this write-up, a more realistic


perspective to consider is Gustafsons law, where we hold the
time constant and see how the problem scale with number of
threads that are executed on individual cores for both OpenMP
and OpenMPI implementations. To consider this, an attempt

For integralMPI, the communication and computation are


non-overlapping. Hence, the runtime for a node can be found
using the following formula.
= 'CDDE&F'#$FC& + 'CDGE$#$FC&

CS3211 Project 1

Wu Wenqi A0124278A


Threads
16
32
64
128
Fastest computation time
2.51 1.39 0.68 0.30
Slowest computation time 3.20 2.40 1.75 1.41
Total runtime
3.24 2.41 1.77 1.47
Figure 2.1.8: Table showing the computation time taken by the fastest
and slowest nodes, as well as the total runtime for integralMPI run on
Tembusu with a strip count of 109

The poor scaling of integralMPI suggests some kind of


bottleneck in the communication time between the nodes. To
be certain, an experiment was carried out to find the
computation time taken for each node, with the results
presented in Figure 2.1.8. We see that the fastest nodes
demonstrated almost perfect scaling in terms of speedup.
However, there is a huge variance in computation times
between the nodes, with the slowest computation time for a
node being a lot slower. The difference between the slowest
computation time and total runtime gives us the communication
time, which is very small in comparison. Hence it is unlikely for
communication be the cause of poor performance.
Here, it is the slowest node that is the bottleneck for the entire
program since the master node will have to wait for it to finish
its tasks before it can return the result. After several runs, it is
observed that the computation time follows a random
distribution, with no particular node being consistently faster or
slower than the rest.
Threads
integralMPI
with 109 strips
integralMPI
with 1010 strips

16
1.51

32
0.79

64
2.64

128
1.34

256
0.68

384
0.71

2.10

1.11

3.73

1.89

0.93

1.04

integralMPI
1.79 0.95
3.19
1.63
0.98
0.91
with 1011 strips
Figure 2.1.9: Runtime (in seconds) for integralMPI on Tembusu using
multiple machines

From Figure 2.1.9, we see that integralMPI scales poorly with


increasing thread size. As discussed earlier, this is likely to be
due to bottlenecks from the slower nodes. To prevent such
scenarios from happening, it is important to ensure that all
nodes offer similar performance since the program will only be
as fast as its slowest node.
Even though the communication time is short, we should still
try to minimize it. For instance, having the master thread
communicating with all the other threads sequentially using
MPI_Send/MPI_Recv to consolidate the strip areas is generally
inefficiently and will lead to () time complexity for n strips.
Instead, MPI_Reduce, which adopts a more efficient tree
reduction algorithm, is called to sum up the strip areas.
An alternative will be to have an OpenMPI implementation that
uses shared memory for communication between nodes on the
same machine (cores on the same processor) instead of having

the message sent through the kernels TCP stack and come
back again. For example, a hybrid OpenMP/MPI program can
be implemented which uses OpenMP for communication within
the machine, and OpenMPI for communication with other
machines.

2.2 Accuracy
Integrating

I H

M IJK L

gives us Leibnizs series, which when

calculated, gives us the value of .


1 1 1 1
= 4 tanUI = 4 1 + + = 1
3 5 7 9

Strip count
Calculated integral value
10
3.3311788072817956
103
3.1435917356731302
106
3.1415946535889016
109
3.1415926555896049
Actual value is 3.1415926535897931.
Figure 2.2.1: Results of calculated using integralMP and
integralMPI with varying strip counts

From Figure 2.2.1 we see that even when 10 strips were used,
the actual value of (if it were represented by a double variable)
could still not be obtained. More strips can be used to obtain
more a more accurate value but there is a limit to the number
of strips used. We have seen previously that adding a very small
floating point number to a very large floating point number will
cause rounding errors. A similar problem will surface if too many
strips were used. The strip area calculated will be too small to
be taken into account when it is added to sum (see Figure
2.2.2).
9

#pragma omp parallel for reduction(+:sum) shared(stripCount, width)


private(i, x)
for (i = 0; i < stripCount; i++){
x = i * width - width / 2;
sum += 4 / (1 + x * x) * width;
}

Figure 2.2.2: Snippet of integerMP.c


Strip count
Lower bound for integral Upper bound for integral
102
3.1368000000
3.1466000000
105
3.1415876436
3.1415976436
108
3.1415926486
3.1415926586
Figure 2.2.3: Table showing results of integralMPI2

An alternative is to use only integers for calculation so as to


avoid rounding errors of floating point numbers. One way to do
this is to divide the space into small squares and count the
number of squares that are bounded by the curve and the xaxis within the domain 0 1. For each strip, we can find
the lower and upper bounds for the number of squares. Adding
them up will give us the lower and upper bounds for the total
number of squares under the curve. We can then translate the
number of squares into an actual value by multiplying it with the
area of a square. This way, we can obtain the lower and upper
bounds of the value of . The file integerMPI2.c contains the

CS3211 Project 1

Wu Wenqi A0124278A


source code to this program. Figure 2.2.3 shows the results
obtained from running integralMPI2.

CONCLUSION
While parallel computing holds a lot of promise, we must make
careful considerations while implementing it in order to reap
maximum performance gain.

You might also like