Professional Documents
Culture Documents
Wu Wenqi A0124278A
PART 1
1.1 Hardware
The hardware comprises a lab machine and the Tembusu
cluster. The lab machine runs on a quad core Intel i7-2600
processor with Ubuntu 14.04 LTS.
Unlike concurrency, parallelism is bound by the number of
processing cores available. The number of processes which
can be run in parallel cannot be more than the cores available,
assuming there is no simultaneous multithreading. If the
program utilizes any more processes than available cores,
some of the cores will run multiple processes in a serial manner
via context switch. This will not lead to added parallelism, and
there will not be any performance gain to be reaped from this.
For this write-up, it is assumed that all threads are run on
separate cores. As such, cores, processors, processes and
threads all refer to the parallel execution of multiples threads on
separate cores and may be used interchangeably.
The lab machine and Tembusu cluster were given a matrix
multiplication problem of varying size to run with different
number of threads using the program mm-shmem. Matrix
multiplication is a problem that can be easily parallelized since
the matrix can be partitioned into smaller block sizes of p,
where p is the number of processes.
We define two experiments to be the same if the problems are
of the same size and the same number of threads are used.
Each experiment was run for 3 iterations, and the runtime was
found to be abnormally long for some of the iterations (see
Figure 1.1.1 and 1.1.2). It is not clear as to why but a possible
cause could be external processes running in the machine. For
example, other users could have logged on to the same lab
machine or Tembusu and were competing for CPU time slices.
Size
Threads
1
128
256
512
1024
2048
128
256
512
1024
2048
0.0323
0.1856
1.2223
8.9999
112.2641
0.0332
1.2161
109.9655
0.1813
8.9704
0.0320
0.1846
1.1944
8.9881
109.6149
2
4.6102
55.3838
0.0171
0.1064
0.7077
4.6227
55.1213
0.0630
0.1101
0.7191
0.0172
0.1123
2.5302
4.4945
54.7532
4
0.0620
0.3845
2.5900
0.0090
27.4877
0.0625
2.6020
0.0090
0.3731
79.8925
0.0090
0.0619
3.2281
2.4220
28.2584
8
0.0348
0.2103
1.4655
66.6549
0.0045
1.3957
0.0376
0.0336
0.2101
15.3964
0.0046
0.0342
0.2101
1.3953
15.6514
16
0.0046
0.0347
0.9479
1.1386
47.0230
0.0046
0.1916
51.6956
0.0321
1.1300
0.0045
0.1957
0.1912
1.1681
10.6101
32
0.0114
0.0269
0.1638
1.1365
9.4150
0.0047
0.1252
1.0927
0.1498
9.3256
0.0036
0.0268
0.7528
1.0764
9.4156
40
0.0272
0.1449
26.4869
0.0038
1.0062
9.4006
0.0044
0.0247
0.1380
1.0109
0.0041
0.1122
0.1420
1.0827
9.2925
Figure 1.1.2: Tabulated result for time (in seconds) to run mm-shmem
using Tembusu. Lowest timing for a particular thread and problem size
is bolded.
CS3211 Project 1
Wu Wenqi A0124278A
1.2 Speedup
Amdahls law
"#$%&'( () =
1
1+
Gustafsons law
"#$%&'( () = 1 +
Figure 1.2.1: For both formulae, Slatency refers to the speedup in latency
of the program execution. s is the speedup in latency of the part of
program that benefits from parallel execution. p is the percentage of
program runtime that will benefit from parallel execution if the program
were running serially.
With Amdahls law (see Figure 1.2.1), we hold the size of the
problem to be constant and calculate the speedup when thread
size increases. We see that the speedup of mm-shmem increases
less than proportionately as the thread size doubles (see Figure
1.2.2). The speedup trend is observed to be logarithmic (see
Figure 1.2.3). This suggests that there is an upper bound for the
maximum speedup of a problem using parallel processing if the
problem size is held constant.
Size 128
256
512
1024
2048
Threads
1
1
1
1
1
1
2
1.87
1.70
1.68
2.00
2.00
4
3.56
2.93
3.20
3.70
3.99
8
7.11
5.39
5.68
6.43
7.11
16
7.11
5.64
6.25
7.94
10.33
32
8.89
6.74
7.97
8.33
11.75
40
8.42
7.34
8.66
8.92
11.80
Figure 1.2.2: Table showing speedup of Tembusu running mm-shmem as
number of cores (threads) increases
Thread size
1
2
4
8
Problem size
480
600
740
890
Actual runtime
1.0437
1.0538 0.9821 0.9497
Figure 1.2.4: Table showing problem size of mm-shmem that will result in
a runtime of 1.0 0.06 seconds on Tembusu for varying thread sizes
800000000
600000000
400000000
200000000
0
0
10
No of processors (threads)
Figure 1.2.5: Graph showing problem size of mm-shmem that will result
in a runtime of 1.0 0.06 seconds on Tembusu for varying thread sizes
15
Speedup
Amount of work
10
5
0
0
10
20
30
40
50
No of processors (threads)
Size 128
Size 256
Size 512
Size 1024
Size 2048
CS3211 Project 1
Wu Wenqi A0124278A
Before retrieving data from the main memory, the processor
first checks its cache to determine if it contains a copy of that
data. If so, a cache hit happens and the processor is able to
retrieve the data directly from the cache. However, if the cache
does not contain that data, the processor has to request for it
from the main memory. This is known as a cache miss, and
incurs the penalty of additional delay.
Array
Reading 1 Reading 2 Reading 3 Average
size (kB)
(sec)
(sec)
(sec)
(sec)
4
0.131194
0.131179
0.132424
0.131599
8
0.124843
0.124554
0.124934
0.124777
16
0.125007
0.124953
0.124902
0.124954
32
0.125009
0.124575
0.124900
0.124828
64
0.132756
0.132920
0.132894
0.132857
128
0.133108
0.133114
0.133099
0.133107
256
0.136035
0.135417
0.135805
0.135752
512
0.163580
0.162282
0.159539
0.161800
1024
0.166573
0.165959
0.165906
0.166146
2048
0.167007
0.165969
0.165975
0.166317
4096
0.166615
0.165706
0.165597
0.165973
8192
0.254448
0.250943
0.253606
0.252999
16384
0.464711
0.464884
0.463626
0.464407
32768
0.467734
0.467380
0.467257
0.467457
65536
0.465453
0.466999
0.463359
0.465270
131072
0.465964
0.464905
0.464421
0.465097
Figure 1.3.1: Tabulated result of timing (in seconds) to run testmem
using lab machine.
the bottom half of data, and the top half will need to be reloaded
into cache from main memory. This causes tremendous
amounts of cache misses and leads to significantly longer
runtime. When testmem is run on an Ubuntu virtual machine
with cache size 3072 kB, runtime increases significantly
between problems with array size 2048 kB and 4096 kB. We
can see that once the array size exceeds the cache size, cache
misses becomes more frequent, leading to longer runtime.
for (i = 0; i < steps; i++)
{
arr[(i * 16) & lengthMod]++;
}
This does not imply that we should limit data to cache sizes.
Instead, we should consider how the sequence in which data is
accessed can be changed to improve the caches hit ratio. For
example, instead of traversing the array several times (see
Figure 1.3.3), we can traverse it only once (see Figure 1.3.4).
Also, we should preferentially access data which is nearer or
has been accessed recently, since there is a higher likelihood
for it to be cached.
int q = 16 * steps / lengthMod;
int r = 16 * steps % lengthMod;
for (i = 0; i < lengthMod; i+=16){
arr[i] += q;
if (i <= r) arr[i]++;
}
Average time
(seconds)
0.6
0.4
0.2
0
0
10
20
30
1.4 Accuracy
In computers, numbers are represented externally as decimal
yet they are stored and processed as binary. This is the
fundamental reason for the observed peculiarities in floating
point arithmetic. Some decimal numbers, while appearing
simple, are unable to be represented in exact form as binary.
For example, a floating point number cannot give an exact
representation of the decimal 0.1.
[sign: 1 bit] [exponent: 8 bits] [mantissa: 23 bits]
Figure 1.4.1: Format of a 32-bit floating point number. A 0 bit for sign
means the number is positive while a 1 bit for sign means the number is
negative. The exponent has a bias of 127, i.e. the zero offset is 127 in
decimal or (0111 1111)2. The smallest exponent is -127 and the largest
exponent is 128.
CS3211 Project 1
Wu Wenqi A0124278A
bits (including the hidden bit). This can represent approximately
8 significant figures in decimal. A number will be rounded off to
the nearest representable value if it cannot be represented in
exact form. Since the mantissa is fixed to 2 bits, there is a
tradeoff between the range of a floating point number and its
precision. When a small number is added to a large number, its
value may be overlooked if its exponent is too small.
24
Adding
Adding
Adding
Adding
Adding
16000000
16500000
17000000
17500000
18000000
1s
1s
1s
1s
1s
to
to
to
to
to
0
0
0
0
0
gives
gives
gives
gives
gives
this
this
this
this
this
result: 16000000.0
result: 16500000.0
result: 16777216.0
result: 16777216.0
result: 16777216.0
500000.3
1000000.3
3000000.2
3500000.2
4500000.0
5000000.0
Reading 1
Reading 2
Reading 3
3062545.500000
3063308.000000
3065318.500000
3062545.500000
3063308.000000
3065318.250000
3062545.500000
3063308.000000
3065318.250000
3062545.750000
3063307.750000
3065318.250000
3062545.750000
3063308.000000
3065318.250000
Figure 1.4.5: Result of running fpomp on lab machine using 8 threads
Reading 1
Reading 2
Reading 3
3065761.000000
3048338.500000
3057366.500000
3065760.500000
3048338.750000
3057366.250000
3065761.000000
3048338.250000
3057366.000000
3065760.250000
3048338.500000
3057366.500000
3065760.500000
3048338.250000
3057366.500000
Figure 1.4.6: Result of running fpomp on Tembusu using 24 threads
This is because for each iteration, the threads are not always
assigned the same partition of numbers to add, and they do not
always add their individual sums to the overall sum in the same
order. This leads to a change in the order of addition operations
performed for each iteration. In the presence of rounding errors,
floating point arithmetic is not associative; any change in order
of operations can lead to inconsistency of results. A
comparison between Figure 1.4.5 and Figure 1.4.6 suggests
that inconsistency of results worsens when more threads are
used.
Inconsistency of results stems from the non-deterministic
execution order of threads in parallel programs. For example, a
change in the order of operations for fpomp leads to multiple
results. If consistency is key, additional rules as to how
parallelism is carried out should be implemented. For example,
we can fix the way tasks are assigned to threads and the order
in which individual sums are added in fpomp to ensure
consistency.
If accuracy is key, we have to ensure that the set of possible
values for all variables can be represented with sufficient
precision and rounding errors are minimized. The simplest way
is to use a larger-sized floating point number, such as a double
or a long double. However, there is always a trade-off between
range and precision for floating point numbers. If our program
comprises floating point operations between large and small
numbers, and high precision of result is required, it may be
better to do away with floating point numbers altogether.
Instead, other data types can be considered besides floating
point when storing real numbers. For example, the unum
datatype allows any number to be stored using variable number
of bits. This enables it to be smaller than float when
performance is concerned and also more precise than float
when accuracy is needed.
CS3211 Project 1
Wu Wenqi A0124278A
1.5 Communication and speedup
No. of
slave
processes
CS3211 Project 1
Wu Wenqi A0124278A
Threads
1
2
4
8
integralMP with 220 strips
1
1.50 1.84 2.37
integralMP with 222 strips
1
1.68 2.61 3.61
integralMP with 224 strips
1
1.94 3.19 5.15
integralMP with 226 strips
1
1.93 3.56 6.25
integralMP with 228 strips
1
1.94 3.51 6.57
integralMPI with 220 strips
1
1.87 3.61 7.05
integralMPI with 222 strips
1
1.67 3.00 5.66
integralMPI with 224 strips
1
1.88 3.02 5.74
integralMPI with 226 strips
1
1.71 3.01 5.68
integralMPI with 228 strips
1
1.91 3.52 6.19
Figure 2.1.2: Speedup for both programs on a single processor
machine. integralMP was run on a lab machine while integralMPI
was run on Tembusu.
was made to find the problem size (strip count) that will result
in a runtime of 1 second for varying thread sizes. As it is
challenging to get an exact runtime of 1 second, a margin of
error is allowed. The results are presented in Figures 2.1.4,
2.1.5, 2.1.6 and 2.1.7.
Thread size
1
2
4
8
Strip count (in 107) 6.215 12.115 21.5
41.5
Actual time (sec)
1.002 1.001
1.008
1.004
Figure 2.1.4: Table showing problem size of integralMP that will result
in a runtime of 1.0 0.008 seconds on lab machine for varying thread
sizes
Strip count
600000000
Studying Figure 2.1.2 column-wise, we see that the speedup of
integralMP improves for same thread size when the problem
size increases. This seems to suggests that the program has
better parallel scaling if the problem size is not held constant,
but allowed to increase with thread size.
400000000
200000000
0
0
10
No of processors (threads)
10
Figure 2.1.5: Graph showing strip count of integralMP that will result
in a runtime of 1.0 0.008 seconds on lab machine for varying thread
sizes
6
4
Thread size
1
2
4
8
Strip count (in 107) 5
10
18
20
Actual time (sec)
1.020 1.028 1.020 0.992
Figure 2.1.6: Table showing problem size of integralMPI that will
result in a runtime of 1.0 0.03 seconds on Tembusu for varying thread
sizes
2
0
0
10
No of processors
2^20
2^22
2^24
2^26
2^28
300000000
Ideal
Strip count
Speedup
200000000
100000000
0
0
10
No of processors (threads)
Figure 2.1.7: Graph showing strip count of integralMPI that will result
in a runtime of 1.0 0.03 seconds on Tembusu for varying thread sizes
28
CS3211 Project 1
Wu Wenqi A0124278A
Threads
16
32
64
128
Fastest computation time
2.51 1.39 0.68 0.30
Slowest computation time 3.20 2.40 1.75 1.41
Total runtime
3.24 2.41 1.77 1.47
Figure 2.1.8: Table showing the computation time taken by the fastest
and slowest nodes, as well as the total runtime for integralMPI run on
Tembusu with a strip count of 109
16
1.51
32
0.79
64
2.64
128
1.34
256
0.68
384
0.71
2.10
1.11
3.73
1.89
0.93
1.04
integralMPI
1.79 0.95
3.19
1.63
0.98
0.91
with 1011 strips
Figure 2.1.9: Runtime (in seconds) for integralMPI on Tembusu using
multiple machines
the message sent through the kernels TCP stack and come
back again. For example, a hybrid OpenMP/MPI program can
be implemented which uses OpenMP for communication within
the machine, and OpenMPI for communication with other
machines.
2.2 Accuracy
Integrating
I H
M IJK L
Strip count
Calculated integral value
10
3.3311788072817956
103
3.1435917356731302
106
3.1415946535889016
109
3.1415926555896049
Actual value is 3.1415926535897931.
Figure 2.2.1: Results of calculated using integralMP and
integralMPI with varying strip counts
From Figure 2.2.1 we see that even when 10 strips were used,
the actual value of (if it were represented by a double variable)
could still not be obtained. More strips can be used to obtain
more a more accurate value but there is a limit to the number
of strips used. We have seen previously that adding a very small
floating point number to a very large floating point number will
cause rounding errors. A similar problem will surface if too many
strips were used. The strip area calculated will be too small to
be taken into account when it is added to sum (see Figure
2.2.2).
9
CS3211 Project 1
Wu Wenqi A0124278A
source code to this program. Figure 2.2.3 shows the results
obtained from running integralMPI2.
CONCLUSION
While parallel computing holds a lot of promise, we must make
careful considerations while implementing it in order to reap
maximum performance gain.