Professional Documents
Culture Documents
1 Introduction 1
2 Communication (I) 19
3 Communication (II) 35
4 Synchronization (I) 51
5 Synchronization (II) 73
9 Naming 149
Bibliography 198
i
CONTENTS CONTENTS
ii
Chapter 1
Introduction
• Tuesday 18.15-19.00,
• Thursday 14.15-15.00
Books:
1. Introduction
2. Communication (I)
1
CHAPTER 1. INTRODUCTION
3. Communication (II)
4. Synchronization (I)
5. Synchronization (II)
7. Fault tolerance
8. File systems
9. Naming
12. Security
Goals:
• transparency,
• scalability.
2
CHAPTER 1. INTRODUCTION
Distributed applications
Middleware service
Network
• some attempts to blindly hide all distribution aspects not always a good
idea,
3
CHAPTER 1. INTRODUCTION
Parallelism transparency
Transparency level with which a distributed system is supposed to appear to the
users as a traditional uniprocessor timesharing system.
[1.7] Openness
Completeness and neutrality of specifications as important factors for interoper-
ability and portability of distributed solutions.
Interoperability
The extent by which two implementations of systems from different manufactures
can cooperate.
Portability
To what extent an application developed for A can be executed without modifi-
cation on some B which implements the same interfaces as A.
4
CHAPTER 1. INTRODUCTION
Client Server
M
FIRST NAME MAARTEN
A
LAST NAME VAN STEEN A
E-MAIL R
STEEN@CS.VU.NL T
E
N
Client Server
5
CHAPTER 1. INTRODUCTION
b. a client
Generic Countries
Z1
int com edu gov mil org net jp us nl
sun yale Z2
acm ieee ac co oce vu
robot pc24
M M M M M M M
P P P P
P P P P
Switch-based
M M M M M M M
P P P P
P P P P
P Processor M Memory
6
CHAPTER 1. INTRODUCTION
Bus
A bus-based multiprocessor.
Memories
CPUs Memories
M M M M
P M
P
P M
P
CPUs
P M
P
P M
P
(a) (b)
a. a crossbar switch,
b. an omega switching network (2k inputs and a like outputs; log2 N stages,
each having N/2 exchange elements at each stage),
7
CHAPTER 1. INTRODUCTION
(a) (b)
a. grid,
b. hypercube.
8
CHAPTER 1. INTRODUCTION
OS interface
User Memory Process File module
application module module User mode
Kernel mode
System call Microkernel
Hardware
Distributed applications
Network
9
CHAPTER 1. INTRODUCTION
Possible
synchronization
point
Sender Receiver
S1 S4
Sender Receiver
buffer buffer
S2 S3
Network
10
CHAPTER 1. INTRODUCTION
0 2 5 1 3 6 4 7 11 13 15
9 8 10 12 14 Memory
(a)
0 2 5 1 3 6 4 7 11 13 15
9 10 8 12 14
(b)
0 2 5 1 3 6 4 7 11 13 15
9 10 8 10 12 14
(c)
11
CHAPTER 1. INTRODUCTION
Distributed applications
Network
12
CHAPTER 1. INTRODUCTION
(a)
Client 1 Client 2
/ /
games private/games
work work
(b) (c)
Distributed applications
Middleware services
Network
13
CHAPTER 1. INTRODUCTION
Middleware Middleware
Common
Network OS protocol Network OS
14
CHAPTER 1. INTRODUCTION
Request Reply
Server
Provide service Time
User-interface
User interface level
HTML page
Keyword expression containing list
HTML
generator Processing
Query Ranked list level
generator of page titles
Ranking
Database queries component
The general organization of an Internet search engine into three different layers.
15
CHAPTER 1. INTRODUCTION
Client machine
User interface User interface User interface User interface User interface
Application Application Application
Database
User interface
Server machine
Database
server
Time
Horizontal distribution
Client or server may be physically split up into logically equivalent parts, but
each part is operating on its own share of the complete data set, thus balancing
the load.
16
CHAPTER 1. INTRODUCTION
Front end
handling
incoming Replicated Web servers each
requests containing the same Web pages
Requests Disks
handled in
round-robin
fashion
Internet
Internet
17
CHAPTER 1. INTRODUCTION
18
Chapter 2
Communication (I)
1. Layered Protocols
4. Message-oriented Communication
5. Stream-oriented Communication
• How many volts should be used to signal a 0-bit, and how many for a
1-bit?
• How does the receiver know which is the last of the message?
19
CHAPTER 2. COMMUNICATION (I)
B: Yes, I did,
Protocol
A well-known set of rules and formats to be used for communication between
processes in order to perform a given task.
20
CHAPTER 2. COMMUNICATION (I)
Application protocol
Application 7
Presentation protocol
Presentation 6
Session protocol
Session 5
Transport protocol
Transport 4
Network protocol
Network 3
Data link protocol
Data link 2
Physical protocol
Physical 1
Network
21
CHAPTER 2. COMMUNICATION (I)
Physical layer
Contains the specification and implementation of bits, and their transmission
between sender and receiver.
Network layer
Describes how packets in a network of computers are to be routed.
Transport Layer
Provides the actual communication facilities for most distributed systems.
Time A B Event
22
CHAPTER 2. COMMUNICATION (I)
• IP packets
Transport layer:
• TCP, UDP
1 1
SYN SYN,request,FIN
2
2
SYN,ACK(SYN) SYN,ACK(FIN),answer,FIN
3
3
4 ACK(SYN)
ACK(FIN)
5 request
FIN
6
ACK(req+FIN)
7
answer 8
FIN
Time 9 Time
ACK(FIN)
(a) (b)
23
CHAPTER 2. COMMUNICATION (I)
• NAT.
Middleware invented to provide common services and protocols that can be used
by many different applications:
Example protocols:
24
CHAPTER 2. COMMUNICATION (I)
Application protocol
Application 6
Middleware protocol
Middleware 5
Transport protocol
Transport 4
Network protocol
Network 3
Data link protocol
Data link 2
Physical protocol
Physical 1
Network
Stack pointer
25
CHAPTER 2. COMMUNICATION (I)
• call-by-value,
• call-by-reference,
• call by copy/restore.
26
CHAPTER 2. COMMUNICATION (I)
3. Message is sent
across the network
3 2 1 0 0 1 2 3 0 1 2 3
0 0 0 5 5 0 0 0 0 0 0 5
7 6 5 4 4 5 6 7 4 5 6 7
L L I J J I L L L L I J
27
CHAPTER 2. COMMUNICATION (I)
c. The message after being inverted. The little numbers in boxes indicate the
address of each byte.
Door
A procedure in the address space of a server process that can be called by process
collocated with the server.
[2.22] Doors
28
CHAPTER 2. COMMUNICATION (I)
Computer
Operating system
Server Call local procedure Time Server Call local procedure Time
and return results
(a) (b)
29
CHAPTER 2. COMMUNICATION (I)
deferred synchronous RPC – asynchronous RPC with second call done by the
server,
one-way RPC – client does not wait for acceptance of the request , problem
with reliability.
Uuidgen
Interface
definition file
IDL compiler
#include #include
Runtime Runtime
Linker Linker
library library
Client Server
binary binary
30
CHAPTER 2. COMMUNICATION (I)
Steps in writing a client and a server in DCE RPC. Let the developer concentrate
only on the client- and server-specific code. Leave the rest for RPC generators
and libraries.
Directory machine
Directory
server
2. Register service
3. Look up server
Server machine
Client machine
31
CHAPTER 2. COMMUNICATION (I)
Network
Marshalled invocation
is passed across network
Runtime objects
Can be implemented in any language, but require use of an object adapter that
makes the implementation appear as an object.
Transient object lives only by virtue of a server: if the server exits, so will the
object.
Persistent object lives independently from a server: if a server exits, the ob-
ject’s state and code remain (passively) on disk.
32
CHAPTER 2. COMMUNICATION (I)
Explicit: client must first explicitly bind to object before invoking it.
Machine A Machine B
Local object
Local Remote object
O1 Remote
reference L1 O2
reference R1
33
CHAPTER 2. COMMUNICATION (I)
• the client calls the server with two references as parameters, O1 and O2,
to local and remote objects,
34
Chapter 3
Communication (II)
1. Layered Protocols
4. Message-oriented Communication
5. Stream-oriented Communication
35
CHAPTER 3. COMMUNICATION (II)
• server essentially waits only for incoming requests and subsequently pro-
cesses them.
Messaging interface
Buffer independent
Routing of communicating Routing
Application program hosts Application
program
To other (remote)
communication
server
OS OS OS OS
36
CHAPTER 3. COMMUNICATION (II)
Persistent communication
A message is stored at a communication server as long as it takes to deliver it
at the receiver.
Transient communication
A message is discarded by a communication server as soon as it cannot be
delivered at the next server or at the receiver.
Post
Pony and rider office
Post Post
office office
Post
Mail stored and sorted, to office
be sent out depending on destination
and when pony and rider available
37
CHAPTER 3. COMMUNICATION (II)
A sends message
and continues A stopped
running
A sends message
and waits until accepted
A stopped
running
Different forms of communication:
A A
Message is stored
Time
at B's location for
later delivery
Accepted
Time
a. persistent asynchronous,
B B
B starts and B is not B starts and
B is not
running
receives
message
running receives
message
b. persistent synchronous,
(a) (b)
A sends message
and continues
Send request and wait
until received
c. transient asynchronous,
A Message can be A
sent only if B is
running Request ACK
d. receipt-based transient syn-
is received
Time Time
B B chronous,
B receives Running, but doing Process
message something else request
(c) (d)
e. delivery-based transient syn-
Send request and wait until Send request
A
accepted
A
and wait for reply chronous,
Request Request Accepted
is received Accepted is received
B
Time
B
Time
f. response-based transient syn-
Running, but doing Process Running, but doing Process
something else request something else request chronous,
(e) (f)
socket
Communication endpoint to which an application write data that are to be sent
over the underlying network and from which incoming data can be read.
38
CHAPTER 3. COMMUNICATION (II)
Server
socket bind listen accept read write close
39
CHAPTER 3. COMMUNICATION (II)
• target to support message transfers that are allowed to take minutes instead
of seconds or milliseconds,
40
CHAPTER 3. COMMUNICATION (II)
Most queuing systems also allow a process to install handlers as callback func-
tions.
Look-up
Sender transport-level Receiver
address of queue
41
CHAPTER 3. COMMUNICATION (II)
Sender A
Application
Application
Receive
queue
R2
Message
Send queue
Application
R1
Receiver B
Application
Router
Queue managers:
Database with
Source client Message broker conversion rules Destination client
Broker
program
Queuing
layer
OS OS OS
Network
42
CHAPTER 3. COMMUNICATION (II)
• range of applications:
Client's receive
Routing table Send queue queue Receiving client
Sending client
Queue Queue
Program manager manager Program
MQ Interface
Server Server
Stub MCA MCA MCA MCA Stub
stub stub
[3.21] Channels
43
CHAPTER 3. COMMUNICATION (II)
44
CHAPTER 3. COMMUNICATION (II)
45
CHAPTER 3. COMMUNICATION (II)
Sending process
Receiving process
Program
Stream
OS OS
Network
(a)
Camera
Display
Stream
OS OS
Network
(b)
Stream Sink
Intermediate
node, possibly
Source with filters
Lower bandwidth
46
CHAPTER 3. COMMUNICATION (II)
A flow specification.
Application
Regular stream
47
CHAPTER 3. COMMUNICATION (II)
Sender process
RSVP-enabled host
Local OS
Reservation requests
Admission from other RSVP hosts
Data link layer
control
Local network
Setup information to
other RSVP hosts
Receiver's machine
Application
Procedure that reads
two audio data units for
each video data unit
Incoming stream
OS
Network
48
CHAPTER 3. COMMUNICATION (II)
Application tells
Receiver's machine middleware what
to do with incoming
Multimedia control streams
Application
is part of middleware
Middleware layer
Incoming stream OS
Network
Multiplex of all substreams into a single stream and demultiplexing at the re-
ceiver. Synchronization is handled at multiplexing/demultiplexing point (MPEG).
49
CHAPTER 3. COMMUNICATION (II)
50
Chapter 4
Synchronization (I)
1. Clock synchronization
2. Logical clocks
4. Election algorithms
5. Mutual exclusion
Synchronization
Setting the time order of the set of events caused by concurrent processes.
51
CHAPTER 4. SYNCHRONIZATION (I)
When each machine has its own clock, an event that occurred after another event
may nevertheless be assigned an earlier time.
[4.3] Timers
• timer,
• registers associated with each crystal:
– counter,
– holding register;
• interrupt generated when counter gets 0,
• interrupt called every clock tick,
• impossible to guarantee two crystals run at exactly the same frequency,
• after getting out of sync, the difference in time values called clock skew.
Earth's orbit
x
Earth on day 0 at the
To distant galaxy
transit of the sun
x
To distant galaxy
Earth on day n at the
transit of the sun
Computation of the mean solar day – the period of the earth’s rotation is not
constant.
52
CHAPTER 4. SYNCHRONIZATION (I)
Transit of the sun the event of the sun reaching its highest apparent point in
the sky.
Solar day the interval between two consecutive transits of the sun.
• mean solar second (300 million days ago a year has about 400 days),
• based on the number of transitions per second of the cesium 133 atom
(pretty accurate),
• introduces a leap second from time to time to compensate that days are
getting longer.
NIST operates a shortwave radio station with call letters WWV from Fort Collins
in Colorado (a short pulse at the start of each UTC second). UTC is broadcast
through short wave radio and satellite. Satellites can give an accuracy of about
±0.5 ms.
Does this solve all our problems? Don’t we now have some global timing
mechanism? This timing is still way too coarse for ordering every event.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
TAI
Solar 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 21 22 23 24 25
seconds
TAI seconds are of constant length, unlike solar seconds. Leap seconds are
introduced when necessary to keep in phase with the sun.
53
CHAPTER 4. SYNCHRONIZATION (I)
• 86400 TAI seconds is about 3 msec less than a mean solar day,
• UTC – TAI with leap seconds whenever the discrepancy between TAI and
solar time grows to 800 msec.
• every machine has a timer that generates an interrupt H times per second,
• ideally, we have that for each machine p, C p(t) = t, or, in other words,
dC/dt = 1
54
CHAPTER 4. SYNCHRONIZATION (I)
dC
>1
dt dC
Clock time, C =1
dt
ck
k
oc
clo
cl
dC
<1
ct
st
k
rfe
c dt
Fa
c lo
Pe
w
Slo
UTC, t
The relation between clock time and UTC when clocks tick at different rates.
Principle I Every machine asks a time server for the accurate time at least once
every δ/(2ρ) seconds.
Principle II Let the time server scan all machines periodically, calculate an
average, and inform each machine how it should adjust its time relative to
its present time.
55
CHAPTER 4. SYNCHRONIZATION (I)
• Averaging Algorithms
Request CUTC
Time server
Time
I, Interrupt handling time
• (T 1 − T 0)/2,
• the message that came back fastest is the most accurate one.
56
CHAPTER 4. SYNCHRONIZATION (I)
Time daemon
3:00 3:00 3:00 0 3:05 +5
1. The time daemon asks all the other machines for their clock values.
• decentralized algorithms:
• Internet: the Network Time Protocol (NTP), accuracy in the range of 1-50
msec.
• internal consistency only matters, not whether they are particularly close
to the real time,
57
CHAPTER 4. SYNCHRONIZATION (I)
• what usually matters is not that all processes agree on what time is, but
rather that they agree on the order in which events occur,
• Lamport’s algorithm, which synchronizes logical clocks,
• an extension to Lamport’s approach, called vector timestamps.
58
CHAPTER 4. SYNCHRONIZATION (I)
1. For any two successive events that take place within Pi , Ci is incremented
by 1.
C j := max{C j + 1, T m + 1}.
0 0 0 0 0 0
6 A 8 10 6 A 8 10
12 16 20 12 16 20
18 24 B 30 18 24 B 30
24 32 40 24 32 40
30 40 50 30 40 50
36 48 C 60 36 48 C 60
42 56 70 42 61 70
48 D 64 80 48 D 69 80
54 72 90 70 77 90
60 80 100 76 85 100
(a) (b)
59
CHAPTER 4. SYNCHRONIZATION (I)
Update 1 Update 2
Replicated database
Update 1 is Update 2 is
performed before performed before
update 2 update 1
• each message is always timestamped with the current logical time of the
sender,
• received message put into a local queue, ordered according to its times-
tamp, receiver multicasts an acknowledgement to others,
• Lamport timestamps do not guarantee that if C(a) < C(b) that a indeed
happened before b. Vector timestamps are required for that.
60
CHAPTER 4. SYNCHRONIZATION (I)
Example
Given V3 = [0, 2, 2], vt(m) = [1, 3, 0]:
What information does P3 have, and what will it do after receiving m (from P1 )?
• all messages sent by one process received in the same order by each other
process,
61
CHAPTER 4. SYNCHRONIZATION (I)
Rules
When message m sent by process P, sent together with vector timestamp vt m
built up in the following way:
Received message m from P delivered into the process Q only if the following
conditions are met:
62
CHAPTER 4. SYNCHRONIZATION (I)
Goal:
All processes should have delivered message m2 only after delivering message
m1. If the message m2 is received by the transport layer of some process as
the first one, delivery of the m2 must be postponed until m1 is received and
delivered before.
63
CHAPTER 4. SYNCHRONIZATION (I)
Comment:
We should not deliver the message m2 sent by B to the process C now because
at the time of sending that message by the process B it knew already some
message received from process A about which we do not know yet.
Perhaps in that message, received before by B and not received by us yet, was
something important what should be received by C before receiving m2. Firstly,
C has to have delivered the previous message, already delivered to B before the
moment of sending by B the message m2.
64
CHAPTER 4. SYNCHRONIZATION (I)
P1 Time P1 Time
m1 m3 m1 m3
P2 P2
m2
m2
P3 P3
Sender of m2 cannot
be identified with this cut
(a) (b)
• not yet recorded: it records its local state, and sends the marker along
each of its outgoing channels,
• already recorded: the marker on C indicates that the channel’s state
should be recorded: all messages received since the time Q recorded
its own state and before that marker to be recorded as the channel’s
state,
65
CHAPTER 4. SYNCHRONIZATION (I)
Incoming Outgoing
message Process State message
M
Q
Local
Marker filesystem
(a)
M
a b c Q M d Q Q
M
a b c a b c d
Recorded
state
(b) (c) (d)
1. Process Q receives a marker for the first time and records its local state.
3. Q receives a marker for its incoming channel and finishes recording the
state of the incoming channel.
• in many systems the coordinator chosen by hand (e.g. file servers). This
leads to centralized solutions ⇒ single point of failure.
• a ring algorithm.
66
CHAPTER 4. SYNCHRONIZATION (I)
Each process has an associated priority (weight). The process with the highest
priority should always be elected as the coordinator.
How to find the heaviest process?
1 1 1
2 5 2 5 2 5
n
ctio OK Election
Ele
n
Election OK
ctio
4 6 4 6 4 6
Ele
n
El
tio
ec
ec
tio
El
0 3 0 3 0 3
n
7 7 7
Previous coordinator
has crashed
(a) (b) (c)
1 1
2 5 2 5
a. process 4 holds an election,
OK
Coordinator
4 6 4 6
b. process 5 and 6 respond, telling 4 to stop,
0 3 0 3
7 an election.
c. now 5 and 6 each hold 7
(d) (e)
67
1 1 1
2 5 2 5 2 5
n
ctio OK Election
Ele
n
Election OK
ctio
4 6 4 6 4 6
Ele
n
El
tio
ec
ec
tio
El
0 3 0 3 0 3
n
7 7 7
CHAPTER 4. SYNCHRONIZATION (I)
Previous coordinator
has crashed
(a) (b) (c)
1 1
2 5 2 5
OK
Coordinator
4 6 4 6
0 3 0 3
7 7
(d) (e)
• if a message is passed on, the sender adds itself to the list. When it gets
back to the initiator, everyone had a chance to make its presence known.
• the initiator sends a coordinator message around the ring containing a list
of all living processes. The one with the highest priority is elected as
coordinator.
68
CHAPTER 4. SYNCHRONIZATION (I)
[5,6,0] 1
Election message
0 2
[2]
Previous coordinator
has crashed 7 [5,6] 3
[2,3]
No response 6 4
[5] 5
0 1 2 0 1 2 0 1 2
Request Release
Request OK
OK
No reply
3 3 3
Queue is 2
empty
Coordinator
69
CHAPTER 4. SYNCHRONIZATION (I)
2. Process 2 then asks permission to enter the same critical region. The
coordinator does not reply.
3. When process 1 exits the critical region, it tells the coordinator, when then
replies to 2.
• in all other cases, reply is deferred, implying some more local administra-
tion.
Enters
critical
region
8
0 0 0
8 OK OK OK
12
8 Enters
1 2 1 2 1 2 critical
12 OK region
12
(a) (b) (c)
1. Two processes want to enter the same critical region at the same moment.
70
CHAPTER 4. SYNCHRONIZATION (I)
2
1 3
0 4
0 2 4 9 7 1 6 5 8 3
7 5
6
(a) (b)
71
CHAPTER 4. SYNCHRONIZATION (I)
72
Chapter 5
Synchronization (II)
• ACID properties
2. Classification of transactions
• flat transactions,
• nested transactions,
• distributed transactions.
3. Concurrency control
• serializability,
• synchronization techniques
– two-phase locking,
– pessimistic timestamp ordering,
– optimistic timestamp ordering.
73
CHAPTER 5. SYNCHRONIZATION (II)
Previous
inventory
New
inventory
Input tapes
Computer Output tape
Today's
updates
74
CHAPTER 5. SYNCHRONIZATION (II)
Atomicity All operations either succeed, or all of them fail. When the transac-
tion fails, the state of the object will remain unaffected by the transaction.
Consistency A transaction establishes a valid state transition. This does not ex-
clude the possibility of invalid, intermediate states during the transaction’s
execution.
Durability After the execution of a transaction, its effects are made permanent:
changes to the state survive failures.
Nested transactions
A hierarchy of transactions that allows (1) concurrent processing of subtransac-
tions, and (2) recovery per subtransaction.
Distributed transactions
A (flat) transaction that is executed on distributed data. Often implemented as a
two-level nested transaction with one subtransaction per node.
• the strength of the atomicity property of a flat transaction also is partly its
weakness,
• difficult scenarios:
75
CHAPTER 5. SYNCHRONIZATION (II)
(a) (b)
1. private workspace
• use a private workspace, by which the client gets its own copy of the
(part of the) database. When things go wrong delete copy, otherwise
commit the changes to the original,
• optimization by not getting everything.
2. write-ahead log
• use a writeahead log in which changes are recorded allowing you to
roll back when things go wrong.
76
CHAPTER 5. SYNCHRONIZATION (II)
Private
workspace
Original
Index index 0 0
0 0 1 1
1 1 2 2
2 2 3 3
1 2 0 1 2 0 1 2
0 3 0 3
Free blocks
(a) (b) (c)
b. The situation after a transaction has modified block 0 and appended block
3,
c. After committing.
a. A transaction,
77
CHAPTER 5. SYNCHRONIZATION (II)
Transactions
LOCK/RELEASE
Scheduler or
Timestamp operations
78
CHAPTER 5. SYNCHRONIZATION (II)
Transaction
manager
d. Possible schedules.
79
CHAPTER 5. SYNCHRONIZATION (II)
Two operations OPER(T i , x) and OPER(T j , x) on the same data item x, and from
a set of logs may conflict at a data manager:
read-write conflict (rw) one is a read operation while the other is a write op-
eration on x,
1. Two-phase locking
Before reading or writing a data item, a lock must be obtained. After a
lock is given up, the transaction is not allowed to acquire any more locks.
2. Timestamp ordering
Operations in a transaction are time-stamped, and data managers are forced
to handle operations in timestamp order.
3. Optimistic control
Don’t prevent things from going wrong, but correct the situation if conflicts
actually did happen. Basic assumption: you can pull it off in most cases.
80
CHAPTER 5. SYNCHRONIZATION (II)
1. When client submits OPER(T i , x), scheduler tests whether it conflicts with
an operation OPER(T j , x) from any other client. If no conflict then grant
LOCK(T i , x), otherwise delay execution of OPER(T i , x).
2. If LOCK(T i , x) has been granted, do not release the lock until OPER(T i , x)
has been executed by data manager.
3. If RELEAS E(T i , x) has taken place, no more locks for T i may be granted.
Lock point
Number of locks
Time
Two-phase locking.
Primary 2PL Each data item is assigned a primary site to handle its locks.
Data is not necessarily replicated,
81
CHAPTER 5. SYNCHRONIZATION (II)
Problems:
• deadlock possible – order of acquiring, deadlock detection, a timeout
scheme,
Lock point
Number of locks
Time
• every data item x has a read timestamp tsRD (x) and a write timestamp
tsWR (x),
• if operations conflicts, the data manager processes the one with the lowest
timestamp,
• comparing to locking (like 2PL): aborts possible but deadlock free.
82
CHAPTER 5. SYNCHRONIZATION (II)
Features:
• deadlock free with maximum parallelism,
• under conditions of heavy load, the probability of failure (and abort) goes
up substantially,
83
CHAPTER 5. SYNCHRONIZATION (II)
• SET AUTOCOMMIT = {0 | 1}
• MySQL uses table-level locking for MyISAM and MEMORY tables, page-
level locking for BDB tables, and row-level locking for InnoDB tables.
• SAVEPOINT identifier
Description:
84
CHAPTER 5. SYNCHRONIZATION (II)
• All savepoints of the current transaction are deleted if you execute a COM-
MIT, or a ROLLBACK that does not name a savepoint.
• SELECT @@global.tx_isolation;
• SELECT @@tx_isolation;
• Suppose that you are running in the default REPEATABLE READ isola-
tion level. When you issue a consistent read (that is, an ordinary SELECT
statement), InnoDB gives your transaction a timepoint according to which
your query sees the database. If another transaction deletes a row and
commits after your timepoint was assigned, you do not see the row as
having been deleted. Inserts and updates are treated similarly.
85
CHAPTER 5. SYNCHRONIZATION (II)
REPEATABLE READ This is the default isolation level of InnoDB. All con-
sistent reads within the same transaction read the snapshot established by
the first such read in that transaction. You can get a fresher snapshot for
your queries by committing the current transaction and after that issuing
new queries.
86
Chapter 6
1. Introduction
4. Consistency protocols
[6.2] Introduction
Two primary reasons for replicating data:
• when and how modifications need to be carried out, determines the price
of replication.
87
CHAPTER 6. CONSISTENCY AND REPLICATION
Conflicting operations:
read–write conflict a read operation and a write operation act concurrently,
write–write conflict two concurrent write operations.
Guaranteeing global ordering on conflicting operations may be a costly operation,
downgrading scalability.
Local copy
The general organization of a logical data store, physically distributed and repli-
cated across multiple processes.
Consistency model
A contract between a (distributed) data store and processes, in which the data
store specifies precisely what the results of read and write operations are in the
presence of concurrency.
88
CHAPTER 6. CONSISTENCY AND REPLICATION
• release consistency,
• entry consistency.
Any read to a shared data item X returns the value stored by the
most recent write operation on X.
89
CHAPTER 6. CONSISTENCY AND REPLICATION
(a) (b)
90
CHAPTER 6. CONSISTENCY AND REPLICATION
Four valid execution sequences for the presented processes. The vertical axis is
time.
This sequence is allowed with a causally-consistent store, but not with sequen-
tially or strictly consistent store.
91
CHAPTER 6. CONSISTENCY AND REPLICATION
P1: W(x)a
P2: R(x)a W(x)b W(x)c
P3: R(x)b R(x)a R(x)c
P4: R(x)a R(x)b R(x)c
Statement execution as seen by the three earlier presented processes. The state-
ments in bold are the ones that generate the output shown.
92
CHAPTER 6. CONSISTENCY AND REPLICATION
Properties:
93
CHAPTER 6. CONSISTENCY AND REPLICATION
(a) (b)
Additional issues:
94
CHAPTER 6. CONSISTENCY AND REPLICATION
• with release consistency, all local updates are propagated to other copies/servers
during release of shared data.
• with entry consistency, each shared data item is associated with a synchro-
nization variable.
• when acquiring the synchronization variable, the most recent values of its
associated shared data item are fetched.
Note: Where release consistency affects all shared data, entry consistency affects
only those shared data associated with a synchronization variable.
Question: What would be a convenient way of making entry consistency more
or less transparent to programmers?
95
CHAPTER 6. CONSISTENCY AND REPLICATION
1. System model
2. Coherence models
• monotonic reads,
• monotonic writes,
• read-your-writes,
• write-follows-reads.
DNS updates are propagated slowly, and inserts may not be immediately visible.
96
CHAPTER 6. CONSISTENCY AND REPLICATION
NEWS articles and reactions are pushed and pulled throughout the Internet,
such that reactions can be seen before postings.
WWW caches all over the place, but there need be no guarantee that you are
reading the most recent version of a page.
• at location B you continue your work, but unless you access the same
server as the one at location A, you may detect inconsistencies:
Note: The only thing you really want is that the entries you updated and/or read
at A, are in B the way you left them in A. In that case, the database will appear
to be consistent to you.
Eventual consistency
97
CHAPTER 6. CONSISTENCY AND REPLICATION
Wide-area network
(a) (b)
The read operations performed by a single process P at two different local copies
of the same data store.
98
CHAPTER 6. CONSISTENCY AND REPLICATION
Example
Reading (not modifying) incoming mail while you are on the move.
Each time you connect to a different e-mail server, that server fetches
(at least) all the updates from the server you previously visited.
(a) (b)
Example
99
CHAPTER 6. CONSISTENCY AND REPLICATION
(a) (b)
(a) (b)
100
CHAPTER 6. CONSISTENCY AND REPLICATION
[6.31] Examples
Read-your-writes example
Updating your Web page and guaranteeing that your Web browser
shows the newest version instead of its cached copy.
Writes-follow-reads example
See reactions to posted articles only if you have the original posting
(a read “pulls in” the corresponding write operation).
Consistency protocol
Describes the implementation of a specific consistency model. We will concen-
trate only on sequential consistency.
• Primary-based protocols
– remote-write protocols,
– local-write protocols.
• Replicated-write protocols
– active replication,
– quorum-based protocols.
101
CHAPTER 6. CONSISTENCY AND REPLICATION
Client Client
Single server
for item x Backup server
W1 W4 R1 R4
W2 R2
W3 R3 Data store
Primary-based remote-write protocol with a fixed server to which all read and
write operations are forwarded.
[6.34] Remote-Write Protocols (2)
Client Client
Primary server
for item x Backup server
W1 W5 R1 R2
W4 W4
W3 W3 Data store
W2 W3
W4
102
CHAPTER 6. CONSISTENCY AND REPLICATION
Client
Current server New server
for item x for item x
1 4
3 Data store
Client Client
Old primary New primary
for item x for item x Backup server
R1 R2 W1 W3
W5 W5
W4 W4 Data store
W5 W2
W4
103
CHAPTER 6. CONSISTENCY AND REPLICATION
A B2 C
Replicated object
Coordinator Coordinator
of object B of object C
A B2 A B2
C2 C2
B3 B3
Result
(a) (b)
104
CHAPTER 6. CONSISTENCY AND REPLICATION
Read quorum
A B C D A B C D A B C D
E F G H E F G H E F G H
I J K L I J K L I J K L
NR = 3, N W = 10 NR = 7, NW = 6 NR = 1, N W = 12
Write quorum
(a) (b) (c)
• coherent enforcement strategy - how caches are kept consistent with the
copies stored at servers.
105
CHAPTER 6. CONSISTENCY AND REPLICATION
106
Chapter 7
Fault Tolerance
2. Process resilience
3. Reliable communication
4. Distributed commit
[7.2] Dependability
A component provides services to clients. To provide services, the component
may require the services from other components ⇒ a component may depend
on some other component.
Dependability
A component C depends on C∗ if the correctness of C’s behavior depends on
the correctness of C∗’s behavior.
Properties of dependability:
107
CHAPTER 7. FAULT TOLERANCE
• fault tolerance: build a component in such a way that it can meet its
specifications in the presence of faults (i.e., mask the presence of faults),
• fault forecasting: estimate the present number, future incidence, and the
consequences of faults.
108
CHAPTER 7. FAULT TOLERANCE
Different types of failures. Crash failures are the least severe, arbitrary failures
are the worst.
A B C
(a)
Voter
A1 V1 B1 V4 C1 V7
A2 V2 B2 V5 C2 V8
A3 V3 B3 V6 C3 V9
(b)
Flat group
Hierarchical group Coordinator
Worker
(a) (b)
109
CHAPTER 7. FAULT TOLERANCE
Group tolerance
When a group can mask any k concurrent member failures, it is said to be k-fault
tolerant (k is called degree of fault tolerance).
If we assume that all members are identical, and process all input in the same
order. How large does a k-fault tolerant group need to be?
Assumption: Group members are not identical, i.e., we have a distributed com-
putation.
Problem: Nonfaulty group members should reach agreement on the same value.
We are trying to reach a majority vote among the group of loyalists, in the
presence of k traitors ⇒ we need 2k+1 loyalists. This is also known as Byzantine
failures.
110
CHAPTER 7. FAULT TOLERANCE
2
1 2
1
2 4
1 x 2 4
1 y 1 Got(1, 2, x, 4 ) 1 Got 2 Got 4 Got
2 Got(1, 2, y, 4 ) (1, 2, y, 4 ) (1, 2, x, 4 ) (1, 2, x, 4 )
4 3 Got(1, 2, 3, 4 ) (a, b, c, d ) (e, f, g, h ) (1, 2, y, 4 )
3 4 4 Got(1, 2, z, 4 ) (1, 2, z, 4 ) (1, 2, z, 4 ) ( i, j, k, l )
z
Faulty process
(a) (b) (c)
1 2
x 1
2 1 Got(1, 2, x ) 1 Got 2 Got
3 2 2 Got(1, 2, y ) (1, 2, y ) (1, 2, x )
y 3 Got(1, 2, 3 ) (a, b, c ) (d, e, f )
Faulty process
(a) (b) (c)
The same as before, except now with 2 loyal generals and one traitor.
111
CHAPTER 7. FAULT TOLERANCE
Error correction:
3. server crashes
5. client crashes
Notes:
3: server crashes are harder as no one knows what server had already done.
112
CHAPTER 7. FAULT TOLERANCE
(a) normal case (b) crash after execution (c) crash before execution.
4: Detecting lost replies can be hard, because it can also be that the server had
crashed. You don’t know whether the server has carried out the operation.
Possible solution: None, except that one can try to make your operations
idempotent – repeatable without any harm done if it happened to be
carried out before.
5: Problem: The server is doing work and holding resources for nothing
(called doing an orphan computation).
Possible solutions:
113
CHAPTER 7. FAULT TOLERANCE
Receiver missed
message #24
History M25
buffer Last = 24 Last = 24 Last = 23 Last = 24
M25 M25 M25 M25
Network
(a)
114
CHAPTER 7. FAULT TOLERANCE
A simple solution to reliable multicasting when all receivers are known and are
assumed not to fail: (a) message transmission and (b) reporting feedback.
Idea: Let a process P suppress its own feedback when it notices another
process Q is already asking for a retransmission.
Assumptions:
• random schedule needed to ensure that only one feedback message is even-
tually sent.
NACK
Network
115
CHAPTER 7. FAULT TOLERANCE
Sender
(Long-haul) connection
S Local-area network
Coordinator
C
C
R
Receiver Root
P1
P2
P3
P4
G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}
116
CHAPTER 7. FAULT TOLERANCE
Application
Message is delivered to application
Comm. layer
Message is received by communication layer
Network
1. for each consistent state, there is a unique view on which all its members
agree. Note: implies that all non-faulty processes see all view changes in
the same order,
117
CHAPTER 7. FAULT TOLERANCE
View change
4 6 4 6 4 6
0 3 0 3 0 3
7 7 7
a. process 4 notices that process 7 has crashed and sends a view change.
b. process 6 sends out all its unstable messages, followed by a flush message.
118
CHAPTER 7. FAULT TOLERANCE
c. process 6 installs the new view when it has received a flush message from
everyone else.
Phase 2a Coordinator collects all votes; if all are YES, it sends COMMIT to
all participants, otherwise it sends ABORT.
Phase 2b Each participant waits for COMMIT or ABORT and handles accord-
ingly.
Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Global-commit
Global-abort Global-commit ACK ACK
ABORT COMMIT ABORT COMMIT
(a) (b)
119
CHAPTER 7. FAULT TOLERANCE
abort state merely make entry into abort state idempotent, e.g., removing the
workspace of results,
commit state also make entry into commit state idempotent, e.g., copying workspace
to storage.
120
CHAPTER 7. FAULT TOLERANCE
Result: If all participants are in the ready state, the protocol blocks. Apparently,
the coordinator is failing.
121
CHAPTER 7. FAULT TOLERANCE
122
CHAPTER 7. FAULT TOLERANCE
Note: not often applied in practice as the conditions under which 2PC blocks
rarely occur.
[7.35] Three-Phase Commit (2)
Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Prepare-commit
Global-abort Prepare-commit ACK Ready-commit
ABORT PRECOMMIT ABORT PRECOMMIT
Ready-commit Global-commit
Global-commit ACK
COMMIT COMMIT
(a) (b)
123
CHAPTER 7. FAULT TOLERANCE
124
Chapter 8
125
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Old file
New file
Requests from
client to access File stays 2. Accesses are
3. When client is done,
remote file on server done on client
file is returned to
server
(a) (b)
Client Server
Network
126
CHAPTER 8. DISTRIBUTED FILE SYSTEM
• version 4
[8.7] Communication
127
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Open file
READ
Read file data
Read file data
Time Time
(a) (b)
• NFS version 3:
• NFS version 4:
128
CHAPTER 8. DISTRIBUTED FILE SYSTEM
vu steen me
Exported directory
contains imported
subdirectory
Client Server A Server B
bin packages
Client
imports
directory
draw from draw
server A Server A
imports
directory
install install from install
server B
Network
Client needs to
explicitly import
subdirectory from
server B
129
CHAPTER 8. DISTRIBUTED FILE SYSTEM
1. Lookup "/home/alice"
users
3. Mount request
NFS client Automounter
alice
2. Create subdir "alice"
home
alice
home tmp_mnt
alice home
alice
"/tmp_mnt/home/alice"
Symbolic link
130
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Some general mandatory (a) and recommended (b) file attributes in NFS.
Moreover one may have named attributes – an array of pairs (attribute, value).
131
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Client machine #1
a b
Process
A
a b c
File server
Original file
Single machine a b
a b
Process
A 3. Read gets "ab"
a b c
Client machine #2
Process
a b
B
Process
B
1. Write "c" 2. Read gets "abc"
(a) (b)
132
CHAPTER 8. DISTRIBUTED FILE SYSTEM
• lock failed ⇒
Disk
cache
Network
133
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Updated file
4. Client sends returns file
XID = 1234
process
request
XID = 1234 reply is lost
Cache Cache Cache
reply
XID = 1234
Time Time Time
134
CHAPTER 8. DISTRIBUTED FILE SYSTEM
c. the reply has been some time ago, but was lost.
• grace period:
[8.21] Security
Client Server
Access Access
control control
• system authentication,
• Kerberos.
135
CHAPTER 8. DISTRIBUTED FILE SYSTEM
RPCSEC_GSS RPCSEC_GSS
GSS-API GSS-API
Kerberos
Kerberos
LIPKEY
LIPKEY
Other
Other
Network
136
CHAPTER 8. DISTRIBUTED FILE SYSTEM
The various kinds of users and processes distinguished by NFS with respect to
access control.
137
CHAPTER 8. DISTRIBUTED FILE SYSTEM
• both Vice file server processes and Venus processes run as user-level pro-
cesses,
Transparent access
to a Vice file server
Virtue
client
Vice file
server
138
CHAPTER 8. DISTRIBUTED FILE SYSTEM
RPC client
stub
Local file
Virtual file system layer
system interface
Local OS
Network
139
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Client
Server
application
Application-specific
RPC Client protocol Server
side effect side effect
Client Client
Server Server
Client Client
Time Time
(a) (b)
[8.31] Naming
140
CHAPTER 8. DISTRIBUTED FILE SYSTEM
bin pkg
• volumes,
• file identifiers,
141
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Volume
replication DB RVID File handle
File server
VID1,
VID2
Server File handle
Server1
Session S A
Client
Close
Open(WR) File f
Client
Time
Session S B
142
CHAPTER 8. DISTRIBUTED FILE SYSTEM
• update from a client accepted only when the update lead to the next version
of a file,
• when conflict occurs, the updates from the client’s session undone and
client forced to save its local version of a file for manual reconciliation
• callback promise,
• callback break.
Session S A Session SA
Client A
Open(RD) Close Close
Open(RD)
Invalidate
Server File f (callback break) File f
Open(WR)
Open(WR) Close Close
Client B
Time
Session S B Session S B
143
CHAPTER 8. DISTRIBUTED FILE SYSTEM
Server Server
S1 S3
Two clients with different AVSG for the same replicated file.
144
CHAPTER 8. DISTRIBUTED FILE SYSTEM
• anyway no guarantee.
HOARDING
Disconnection Reintegration
Disconnection completed
EMULATION REINTEGRATION
Reconnection
• http://www.coda.cs.cmu.edu/
145
CHAPTER 8. DISTRIBUTED FILE SYSTEM
[8.42] Plan 9
• bringing back the idea of having a few centralized servers and numerous
client machines,
• for LAN Internet Link (IL) reliable datagram protocol, TCP for WAN.
To Internet
Client has
mounted
NS2 NS1 and NS2
NS1
NS3 NS2
Client Client
146
CHAPTER 8. DISTRIBUTED FILE SYSTEM
[8.44] Communication
• opening a telnet connection requires writing a special string to the ctl file
”connect 192.31.231.42!23”.
[8.45] Processes
In-memory
cache
Disk WORM
cache
147
CHAPTER 8. DISTRIBUTED FILE SYSTEM
• multiple name spaces can be mounted at the same mount point, leading
to union directory,
[8.47] Naming
FSA FS B
/remote
/home /usr /bin /src /lib /bin /src /lib /home /usr
• http://cm.bell-labs.com/plan9/
• http://www.vitanuova.com/inferno/
148
Chapter 9
Naming
• to share resources,
• to refer to locations.
• the address of an entity access point simply called an address of the entity.
• if an entity offers more than one access point not clear which address to
use as a reference,
149
CHAPTER 9. NAMING
Remarks:
• absolute path name (starts with root) vs. relative path name,
150
CHAPTER 9. NAMING
Data stored in n1 n0
n2: "elke" home keys
n3: "max"
n4: "steen" "/keys"
n1 n5
"/home/steen/keys"
elke steen
max
n2 n3 n4 keys
Leaf node
.twmrc mbox
Directory node
"/home/steen/mbox"
• a name lookup returns the identifier of a node from where the name res-
olution process continues,
– Unix file system: the inode of the root directory is the first inode in
the logical disk,
– ”000312345654” not recognizable as string, but recognizable as a
phone number,
151
CHAPTER 9. NAMING
Data stored in n1 n0
n2: "elke" home keys
n3: "max"
n4: "steen" n1 n5 "/keys"
elke steen
max
n2 n3 n4
Data stored in n6
Leaf node
.twmrc mbox keys "/keys"
Directory node
n6 "/home/steen/keys"
Remarks:
• NFS as an example.
152
CHAPTER 9. NAMING
keys
remote home
vu steen
"nfs://flits.cs.vu.nl//home/steen"
mbox
OS
Network
Reference to foreign name space
• in DECs GNS (Global Name Service) new root is added making all
existing root nodes its children,
• names in GNS always (implicitly) include the identifier of the node from
where resolution should normally start,
• hidden expansion,
153
CHAPTER 9. NAMING
m0 home
n0 vu
oxford
vu
NS1 NS2
n0 m0
"m0:/mbox"
elke max steen
– global layer,
– administrational layer,
– managerial layer.
154
CHAPTER 9. NAMING
Global
layer gov mil org net
com edu jp us
nl
pc24
robot pub
globe
Mana-
gerial
layer Zone
index.txt
• ftp://ftp.cs.vu.nl/pub/globe/index.txt
155
CHAPTER 9. NAMING
• a name resolver hands over the complete name to the root name server,
but root resolves only nl and returns address of the associated name server
• a name server passes the result to the next name server it finds,
1. <nl,vu,cs,ftp>
Root
2. #<nl>, <vu,cs,ftp> name server
nl
3. <vu,cs,ftp> Name server
nl node
Client's 4. #<vu>, <cs,ftp>
name vu
resolver 5. <cs,ftp> Name server
vu node
6. #<cs>, <ftp>
cs
7. <ftp> Name server
8. #<ftp> cs node
ftp
<nl,vu,cs,ftp> #<nl,vu,cs,ftp>
Nodes are
managed by
the same server
156
CHAPTER 9. NAMING
1. <nl,vu,cs,ftp>
Root
8. #<nl,vu,cs,ftp> name server 2. <vu,cs,ftp>
<nl,vu,cs,ftp> #<nl,vu,cs,ftp>
Recursive name resolution of <nl, vu, cs, ftp>. Name servers cache intermediate
results for subsequent lookups.
157
CHAPTER 9. NAMING
Long-distance communication
The comparison between recursive and iterative name resolution with respect to
communication costs.
The most important types of resource records forming the contents of nodes in
the DNS name space.
158
CHAPTER 9. NAMING
Part of the description for the vu.nl domain which contains the cs.vu.nl domain.
[9.22] X.500
159
CHAPTER 9. NAMING
• DIT forms a naming graph in which each node represents a directory entry,
160
CHAPTER 9. NAMING
C = NL
O = Vrije Universiteit
CN = Main server
N
Host_Name = star Host_Name = zephyr
161
CHAPTER 9. NAMING
• /.../ENG.IBM.COM.US/nancy/letters/to/lucy
• /.../Country=US/OrgType=COM/OrgName=IBM/
Dept=ENG/nancy/letters/to/lucy
• /.:/nancy/letters/to/lucy
[9.27] LDAP
• LDAP contains:
• http://www.openldap.org
• human-friendly names,
• identifiers,
• addresses.
162
CHAPTER 9. NAMING
• recording the name of the new machine in the DNS for cs.vu.nl but,
Naming
service
Entity ID
Location
service
Address Address Address Address Address Address
(a) (b)
1. simple solutions
163
CHAPTER 9. NAMING
2. home-Based approaches,
3. hierarchical approaches.
Process P1 Skeleton
Proxy p
Process P4 Object
Local
invocation
Interprocess
communication Identical
skeleton
Skeleton is no
Invocation longer referenced
request is by any proxy
sent to object
164
CHAPTER 9. NAMING
Host's home
location 1. Send packet to host at its home
2. Return address
of current location
Client's
location
3. Tunnel packet to
current location
Directory node
dir(S) of domain S
A subdomain S
of top-level domain T
(S is contained in T)
165
CHAPTER 9. NAMING
Location record
with only one field,
containing an address
Domain D1
Domain D2
Node knows
about E, so request
Node has no is forwarded to child
record for E, so
that request is M
forwarded to
parent
Look-up
Domain D
request
166
CHAPTER 9. NAMING
Node knows
Node has no about E, so request
record for E, is no longer forwarded
so request is Node creates record
forwarded and stores pointer
to parent M
M
Node creates
record and
stores address
Domain D
Insert
request
(a) (b)
1. An insert request is forwarded to the first node that knows about entity E.
167
CHAPTER 9. NAMING
168
Chapter 10
Peer-to-Peer Systems
• standard services scalability limited when all the hosts must be owned and
managed by the single service provider,
• administration and fault recovery costs tend to dominate.
169
CHAPTER 10. PEER-TO-PEER SYSTEMS
– algorithms for data placement across many hosts and subsequent ac-
cess to it,
– key issues of these algorithms: workload balancing, ensuring avail-
ability without adding undue overheads.
170
CHAPTER 10. PEER-TO-PEER SYSTEMS
• usage for objects with dynamic state more challenging, usually addressed
by addition of trusted servers for session management and identification.
• Scale:
IP: IPv4 limited to 232 addressable nodes (in IPv6 to 2128), addresses
hierarchically structured and much of the space preallocated accord-
ing to administrative requirements.
OR: The GUID name space very large and flat (>2128), allowing it to be
much more fully occupied.
• Load balancing:
• Fault tolerance:
• Target identification:
171
CHAPTER 10. PEER-TO-PEER SYSTEMS
IP: Addressing is only secure when all nodes are trusted. Anonymity for
the owners of addresses is not achievable.
OR: Security can be achieved even in environments with limited trust. A
limited degree of anonymity can be provided.
• work with the first personal computers at Xerox PARC showed the feasi-
bility of performing loosely-coupled compute-intensive tasks by running
background processes on about 100 computers linked by a local network,
172
CHAPTER 10. PEER-TO-PEER SYSTEMS
• climate prediction.
Grid projects - distributed platforms that support data sharing and the coordina-
tion of computation between participating computers on a large scale. Resources
are located in different organizations and are supported by heterogeneous com-
puter hardware, operating systems, programming languages and applications.
173
CHAPTER 10. PEER-TO-PEER SYSTEMS
... peers
4. File
5. Index update
delivered
peers ...
Napster: P2P file sharing with a centralized, replicated index. In step 5. clients
expected to add their own files to the pool of shared resources.
• global scalability,
174
CHAPTER 10. PEER-TO-PEER SYSTEMS
Prefix routing - narrowing the search for the next node along the route by
applying a binary mask that selects an increasing number of hexadecimal digits
from the destination GUID after each hop.
put(GUID, data)
The data is stored in replicas at all nodes responsible for the object identified by
GUID.
remove(GUID)
Deletes all references to GUID and the associated data.
175
CHAPTER 10. PEER-TO-PEER SYSTEMS
value = get(GUID)
The data associated with GUID is retrieved from one of the nodes responsible
for it.
publish(GUID)
GUID can be computed from the object (or some part of it, e.g. its name). This
function makes the node performing a publish operation the host for the object
corresponding to GUID.
unpublish(GUID)
Makes the object corresponding to GUID inaccessible.
sendToObj(msg, GUID, [n])
Following the object-oriented paradigm, an invocation message is sent to an
object in order to access it. This might be a request to open a TCP connection
for data transfer or to return a message containing all or part of the object’s state.
The final optional parameter [n], if present, requests the delivery of the same
message to n replicas of the object.
Basic programming interface for distributed object location and routing (DOLR)
as implemented by Tapestry.
• when data submitted to be stored with its GUID DHT layer takes respon-
sibility for choosing a location, storing it (with replicas) and providing
access,
• data item with GUID X stored at the node whose GUID numerically closest
to X and moreover at the r hosts with GUIDs numerically closest to it,
where R is a replication factor chosen to ensure high availability.
DOLR:
• locations for the replicas of data objects decided outside the routing layer,
176
CHAPTER 10. PEER-TO-PEER SYSTEMS
• host address of each replica notified to DOLR using the publish() operation.
• tracker – server that keeps track of which seeds and peers are in the
swarm, not directly involved in the data transfer, does not have copies of
data files.
177
CHAPTER 10. PEER-TO-PEER SYSTEMS
• each active node stores a leaf set – a vector L (of size 2l) containing
the GUIDs and IP addresses of the nodes whose GUIDs are numerically
closest on either side of its own (above and below),
178
CHAPTER 10. PEER-TO-PEER SYSTEMS
Black color depicts live nodes. The space is considered as circular: node 0 is
adjacent to node (2128 − 1). The diagram illustrates the routing of a message
from node 65A1FC to D46A1C using leaf set information alone, assuming leaf
sets of size 8 (l = 4, in Pastry usually 8). This is a degenerate type of routing
that would scale very poorly; it is not used in practice.
[10.23] Pastry Routing
179
CHAPTER 10. PEER-TO-PEER SYSTEMS
• new nodes use a joining protocol and compute suitable GUIDs (typically
by applying the SHA-1 to the node’s public key, then it make contact with
a nearby (in network distance) Pastry node.
First four rows of a Pastry routing table located in a node whose GUID begins
with 65A1.
• each ”n” element represents [GUID, IP address] pair specifying next hop
to be taken by messages addressed to GUIDs that match each given prefix.
• grey-shaded entries indicate that the prefix matches the current GUID up
to the given value of p: the next row down or the leaf should be examined
to find a route,
• although there are a maximum of 128 rows in the table, only log 16 N rows
will be populated on average in a network with N active nodes.
} else {
180
CHAPTER 10. PEER-TO-PEER SYSTEMS
Routing a message from node 65A1FC to D46A1C. With the aid of a well-
populated routing table the message can be delivered in log16 (N) hops.
181
CHAPTER 10. PEER-TO-PEER SYSTEMS
• nodes may fail or depart without warning, node considered failed when its
immediate neighbours (in GUID space) can no longer communicate with
it,
• to repair leaf set, the node looks for a live node close to the failed one
and requests a copy of its leaf set (one value to replace),
[10.28] Tapestry
• 160-bit identifiers used to refer both to objects and to nodes that perform
routing actions,
• for any resource with GUID G unique root node with GUID RG numerically
closest to G,
• on receipt RG enters mapping between G and the sending host’s IP, (G, IP H )
in its routing table, the same cached along publication path.
182
CHAPTER 10. PEER-TO-PEER SYSTEMS
Replicas of the file Phil’s Books (G=4378), hosted at nodes 4228 and AA93.
Node 4377 is the root node for object 4378. Shown routings are some of the
entries in routing tables. The location mapping (cached while servicing publish
messages) are subsequently used to route messages sent to 4378.
• developed by authors of Pastry P2P web caching service for use in local
networks,
• conditional GET (cGET) request issued to the next level for validation,
183
CHAPTER 10. PEER-TO-PEER SYSTEMS
• SHA-1 hash function applied to the URL of each cached object to produce
a 128-bit Pastry GUID, GUID not used to validate content,
• Squirrel routes a Get or a cGet request via Pastry to the home node.
Evaluation, two real working environments within Microsoft, 105 active clients
(Cambridge), 36000 active clients (Redmond):
• goal: very large scale, scalable persistent storage facility for mutable data
objects with long-term persistence and reliability in changing network and
computing resources environment,
• data stored in a set of blocks, data blocks organized and accessed through
a metadata block called root block,
184
CHAPTER 10. PEER-TO-PEER SYSTEMS
185
CHAPTER 10. PEER-TO-PEER SYSTEMS
Version i + 1 has been updated in blocks d1, d2 and d3. The certificate and the
root blocks include some data not shown. All unlabelled arrows are BGUIDs.
Times in seconds to run different phases of the Andrew benchmark. (1) recursive
subdirectory creation, (2) source tree copying, (3) status only examining of all
the files in the tree, (4) every data byte examining in all the files, (5) compiling
and linking the files.
• stores the state of files as logs of the file update requests issued by Ivy
clients,
• version vectors to impose a total order on log entries when reading from
multiple logs,
• shared file system seen as a result of merging all the updates performed
by (dynamically selected – views) set of participants,
186
CHAPTER 10. PEER-TO-PEER SYSTEMS
187
CHAPTER 10. PEER-TO-PEER SYSTEMS
• ability to support large numbers of clients and hosts with adequate bal-
ancing of the loads on network links and host computer resources,
188
Chapter 11
Web Services
• support for Unicode, allowing almost any information in any human lan-
guage to be communicated,
• the strict syntax and parsing requirements that allow the necessary parsing
algorithms to remain simple, efficient, and consistent.
189
CHAPTER 11. WEB SERVICES
DTD Document Type Definition, inherited from SGML, included in the XML
1.0 standard,
XSD XML Schema Definition, schema with rich datatyping system and XML
syntax,
• Other systems interact with the Web service in a manner prescribed by its
description using SOAP messages, typically conveyed using HTTP with
an XML serialization in conjunction with other Web-related standards.
190
CHAPTER 11. WEB SERVICES
• like CORBA and Java, the interface of web services can be described in an
IDL. But for web services, additional information including the encoding
and communication protocols in use and the service location need to be
described,
XML All data to be exchanged is formatted with XML tags. The encoded
message may conform to a messaging standard such as SOAP or the older
XML-RPC. The XML-RPC scheme calls functions remotely, whilst SOAP
favours a more modern (object-oriented) approach based on the Command
pattern.
UDDI protocol for publishing the web service information. Enables applications
to look up web services information in order to determine whether to use
them.
Web Services Protocol Stack standards and protocols used to consume a web
service.
Common protocols protocols for data transport such as HTTP, FTP and SMTP.
191
CHAPTER 11. WEB SERVICES
Applications
Directory service Security Choreography
SOAP
• SOAP protocol specifies the rules for using XML to package messages,
for example to support a request-reply protocol,
• SOAP used to encapsulate these messages and transmit them over HTTP
or another protocol,
192
CHAPTER 11. WEB SERVICES
• Web services do not provide means for coordinating their operations with
one another.
Envelope top level root element of a SOAP message, which contains the header
and body element.
Header a collection of zero or more SOAP header blocks each of which might
be targeted at any SOAP receiver within the SOAP message path.
193
CHAPTER 11. WEB SERVICES
• the rules as how the recipients of messages should process the XML ele-
ments that they contain,
• how HTTP and SMTP should be used to communicate SOAP messages.
It is expected that future versions of the specification will define how to
use other transport protocols, for example, TCP.
envelope
header
body
env:body
m:exchange
xmlns:m = namespace URI of the service description
m:arg1 m:arg2
Hello World
194
CHAPTER 11. WEB SERVICES
env:body
m:exchangeResponse
xmlns:m = namespace URI of the service description
m:res1 m:res2
World Hello
</env:Envelope>
REST – (common meaning:) any simple web-based interface that uses XML
and HTTP without the extra abstractions of MEP-based approaches like the
web services SOAP protocol. It is possible to design web service systems in
accordance with Fielding’s REST architectural style (RESTful systems).
REST is an architectural style and not a standard.
195
CHAPTER 11. WEB SERVICES
• wscompile and wsdeploy to generate the skeleton class and the service
description (in WSDL),
• client program may use static proxies, dynamic proxies or a dynamic in-
vocation interface.
• supported operations and messages are described abstractly, and then bound
to a concrete network protocol and message format.
196
CHAPTER 11. WEB SERVICES
definitions
types message interface bindings services
abstract concrete
[11.21] WSDL Example WSDL request and reply messages for the newShape
operation
197
CHAPTER 11. WEB SERVICES
198
Bibliography
199