You are on page 1of 13

A Simple Optimal Parallel Algorithm for Reporting

Paths in a Tree
Andrzej Lingas
Department of Computer Science
Lund University
Box 118, Lund. S-22 100, Sweden
Andrzej.Lingas@dna.lth.se
Anil Maheshwari
Computer Systems and Communications Group
Tata Institute of Fundamental Research
Homi Bhabha Road
Bombay - 400 005, India
manil@tifrvax.tifr.res.in

Running title : Parallel algorithm for Reporting Paths in Tree.

Mailing address:
Andrzej Lingas
Department of Computer Science
Lund University
Box 118, Lund, S-22 100, Sweden.
Andrzej.Lingas@dna.lth.se

Abstract

We present optimal parallel solutions to reporting paths between pairs of nodes


in an n?node tree. Our algorithms are deterministic and designed to run on an
exclusive read exclusive write parallel random-access machine (EREW PRAM). In
particular, we provide a simple optimal parallel algorithm for preprocessing the
input tree such that the path queries can be answered e ciently. Our algorithm
for preprocessing runs in O(log n) time using O(n=log n) processors. Using the
preprocessing, we can report paths between k node pairs in O(log n + log k) time
using O(k + (n + S )=log n) processors on an EREW PRAM, where S is the size of
the output. In particular, we can report the path between a single pair of distinct
nodes in O(log n) time using O(L=log n) processors, where L denotes the length of
the path.

1 Introduction
In many problems in computer science, the underlying structure is a tree. For several
fundamental algorithmic problems on tree, optimal parallel algorithms are known. For
example, computing Euler tour and several functions on tree 13], evaluating expression
tree 2], lowest common ancestor 11], dictionaries on 2-3 trees 10].
Enumerating paths between pairs of nodes is one of the fundamental operation on trees.
For instance, a request for reporting a path from a vertex in a connected graph to another
vertex in the graph can be reduced to the corresponding path query on any spanning tree
of the graph 12]. Analogously, a request for a shortest path from a distinguished vertex v
of a graph or a geometric structure to another vertex reduces to the corresponding query
in the tree of shortest paths that is rooted at v 6]. Also, in computational geometry,
dominance problems require enumerating paths between pairs of nodes as one of the
basic routines 3, 5, 7]. In these problems a range-range priority search tree is computed
over the given planar point set. The dominance related problems are solved by computing
paths from many leaves to the root of the search tree and at each internal node on the
path certain computations are carried out. In 4], Djidjev et al. used path queries for
trees as subroutines in their parallel solutions to shortest path queries for planar digraphs.
After an O(log n)-time and O(n=log n) -processor preprocessing on an EREW PRAM they
achieved O(L) time for reporting a path of length L in the input tree on n nodes, using
single processor.
In this paper we present new results on path queries in trees which yield substantial
improvements over the aforementioned results. We provide a simple optimal (in the timeprocessor product sense, see 8, 9]) logarithmic-time algorithm for preprocessing the input
tree such that the path queries can be answered e ciently. Using the preprocessing, we
can rstly reduce the time of reporting a path of length L to O(log n) by using the optimal
number of EREW PRAM processors, i.e., O(L=log n). Our method avoids applying the
standard concurrent-read pointer-jumping technique. This enables us to derive our main
result on parallel path queries in the EREW PRAM model: we answer k path queries
in parallel in O(log n + log k) time using O(k + (n + S )=log n) processors on an EREW
PRAM, where S is the size of the output.
In our algorithms, we assume the so-called adjacency list representation of the input
tree. The paths are reported in an array.
Our algorithms use an output sensitive number of processors. The output size, i.e.,
S is not known in advance. We compute the size of the output on the y and allocate
the required number of processors in the spirit of 5]. The processors are `spawned'
depending on the output size. When new processors are allocated, a global array of
pointers is created. A processor can know the exact location from where it should start
working by accessing this array. For the details of this computation model, see Goodrich
5].
In our optimal parallel solutions, we use the following tools previously developed in
the area of parallel computing: parallel merge-sort 1], list ranking and doubling, parallel
4

pre x sums (e.g., see 8, 9]), tree operations including the Euler tour technique 13], lowest
common ancestor in a tree 11], searching in search trees 10].
The remainder of the paper is organized as follows. In Section 2, we prove several
basic technical lemmas. In Section 3 we describe and analyze the tree preprocessing step
and then provide the optimal algorithms for reporting paths relying on the preprocessing.
In Section 4 we conclude with a few open problems.

2 Preliminaries
Let T denote the n-node rooted binary tree and let us further assume that T is represented
in such a way that the Euler tour of T can be computed in O(log n) time using O(n=log n)
processors 13] (e.g., adjacency list representation). Let size(v), postorder(v) and level(v)
respectively denote the number of vertices in the subtree rooted at v, postorder number
of v in the postorder traversal and level number of v, where v is a node of T . To facilitate
our discussion, we prove the following lemmas.

Lemma 2.1 Let A be an array of O(log n) elements. We can make k copies of A in


O(log n + log k) time using O(k) processors on an EREW PRAM.

Proof: The proof uses the technique of pipelining. First compute a balanced binary

tree on k nodes. Assume that there is a processor at each node of the tree. Now copy
the rst element of the input array A into the zeroth level, i.e., the root of the tree. In
the next step this element is copied into every node on the rst level of the tree, and so
on. Finally in O(log k) stages there are k copies of this element and each leaf of the tree
contains one copy. While the rst element of A is getting copied to nodes in the rst
level, we can initiate the process for the second element of A. Clearly in O(log n + log k)
stages, copy of each element of A will reside in every leaf of the binary tree.
Construct an Euler tour of T . For each node v of T compute level(v).
For each leaf node v of T , create a linear array Av of size level(v):
Label the leaves of T by distinct consecutive numbers.
For each node v of T , compute the interval, range(v), of the labels of the leaves in
its subtree.
5. For each z 2 range(v), Az level(v)] := v.

1.
2.
3.
4.

Algorithm 1: An algorithm for reporting paths from all leaves to the root
5

Lemma 2.2 Let T be a rooted binary tree on n-nodes. Algorithm 1 correctly computes

all paths from leaves to the root in T in O(log n) time using O(S=log n) processors on an
EREW PRAM, where S is the total sum of nodes in the paths from all leaves to the root
(i.e., the size of the output).

Proof: First we show the correctness of the algorithm and then we analyze its
complexity. Clearly, for any leaf v of T , the path to the root is of length level(v). For any
internal node v of T , only leaves in its subtree are in range(v). Hence, for any internal
node v of T , we know precisely where v is going to appear in the paths from the leaves in
range(v) to the root.
Now we analyze the complexity stepwise.
Step 1 takes O(log n) time using O(n=log n) processors 13].
Consider Step 2. Let S be the sum of level(v) over all v, where v is a leaf of T . The
sum S can be computed in O(log n) time using O(n=log n) processors. Clearly S is (n).
Next, by using parallel pre x sums, the optimum number of processors to compute the
arrays Av corresponding to each leaf v of T can be respectively assigned. Hence this step
can be performed in O(log n) time using O(S=log n) processors on an EREW PRAM.
Consider Step 3. We can label the leaves by using the Euler tour technique as follows.
First compute the Euler tour of T . Then mark all leaves in the tour. Now compute the
postorder numbering of only the marked nodes in the tree traversal. So, this step can be
performed in O(log n) time using O(n=log n) processors.
Consider Step 4. We can compute range(v) for every node v by using the Euler tour
of T and parallel list ranking. Hence this step requires O(log n) time using O(n=log n)
processors.
Finally consider Step 5. Note that the total size of the output to be written is S .
Using parallel pre x sums allocate O(S=log n) processors such that each processor, with
only exclusive read capabilities, writes in the appropriate location of the arrays Az in
O(log n) time, where z 2 range(v).
Lemma 2.3 Let T be a rooted binary tree on n-nodes and with k

n distinct marked
nodes. We can compute paths in T from all marked nodes to the root of T in O(log n)
time using O((n + S )=log n) processors on an EREW PRAM, where S is the total sum of
nodes in the paths from the marked nodes to the root (i.e., the size of the output).

Proof: The algorithm and the proof are analogous to Algorithm 1 and the proof of

Lemma 2.2, respectively. The marked nodes play the role of leaves. Note that the postorder numbering restricted to the marked nodes takes logarithmic time using O(n=log n)
processors. As S may be smaller than n; the total number of processors used by the
algorithm is O((n + S )=log n):

Lemma 2.4 Let T be a rooted binary tree on n-nodes. There exists an integer m0 2
0; dlog ne) such that there are at most dn=log ne nodes v of T , satisfying m0 = level(v)
6

mod dlog ne. Furthermore, it can be computed in O(log n) time using O(n=log n) processors on an EREW PRAM.

Proof: First we show the existence of m0. Partition the nodes of T into sets Si,
where v 2 Si if and only if i = level(v) mod dlog ne. There are in all log n sets, therefore
there exists a set Sm0 consisting of at most dn=log ne elements.
Now we show that m0 can be computed within the claimed complexity bounds. First,
the set of nodes of T is divided into O(n= log n) subsets U kept in arrays of size O(log n);
by applying the optimal EREW parallel pre x sums algorithm 8, 9]. Next, for each
subset U; the integers level(v) mod dlog ne over nodes in U are bucket sorted by a single
processor in O(log n) time. Then, for all integers m in 0; dlog ne), the numbers nm(U ) of
the occurences of m in the sorted sequence for U are computed, again
in O(log n) time
using a single processor. Now, it is su cient to compute thePsums PU nm(U ) in parallel
for all m in 0; dlog ne) in order to set m0 to an m minimizing U nm (U ). It takes O(log n)
time and totally O(((n= log n)= log n) log n) processors.
Lemma 2.5 A binary tree on n-nodes can be preprocessed in O(log n) time using

O(n=log n) processors such that k lowest common ancestor queries can be answered in
O(log k) time using O(k) processors on the EREW PRAM.

Proof: Schieber and Vishkin 11] have proposed an algorithm for preprocessing a

binary tree in O(log n) time using O(n=log n) processors such that a single processor can
answer a lowest common ancestor query in O(1) time. Hence k queries can be answered in
O(1) time using O(k) processors on a CREW PRAM. By using the standard simulation
of the CREW PRAM on an EREW PRAM, we get the desired complexity.

Lemma 2.6 (Paul, Vishkin and Wagener 10]) A sequence of k keys can be searched in
an n?node binary search tree in O(log n +log k) time using O(k) processors on an EREW
PRAM.

3 The preprocessing and the queries


In this section rst we state an algorithm (Algorithm 2) for preprocessing the n-node
binary tree T . Then we show that the path queries can be answered e ciently using the
data computed in the preprocessing step.

Lemma 3.1 Algorithm 2 preprocesses an n-node binary tree in O(log n) time using

O(n=log n) processors on an EREW PRAM and the resulting data structure is of linear size.

1. Preprocess T for answering the lowest common ancestor queries.


2. Perform an Euler tour traversal of T and compute postorder(v) and level(v) for
each node v 2 T .
3. Find m0 2 0; dlog ne) such that there are O(n= log n) nodes v of T satisfying m0 =
level(v) mod dlog ne and mark all such nodes.
4. Group the marked nodes according to their level number in T and let Gi denote the
marked nodes at level i in T , where m0 = i mod dlog ne.
5. For each group Gi, compute a binary search tree Bi, where the leaf nodes in Bi are
the nodes at level i in T . The leaf nodes in Bi are sorted in the increasing order by
postorder(v).

Algorithm 2: An algorithm for preprocessing a binary tree


Proof: Steps 1, 2 can be implemented in O(log n) time using O(n=log n) processors
11, 13].
Step 3 can be implemented as follows. First compute m0 using Lemma 2.4 and then
m0 is broadcasted to each vertex in logarithmic time using O(n=log n) processors 8, 9].
A vertex v marks itself if level(v) mod dlog ne = m0.
To implement Step 4, the marked vertices are extracted using the optimal parallel
pre x sums algorithm 8, 9] and then sorted according to their level number. Since the
number of marked nodes is at most O(n=log n) , the sorting can be done in logarithmic
time with O(n=log n) processors in the EREW PRAM model 1]. Thus the groups Gi can
be computed in O(log n) time using O(n=log n) processors.
To implement Step 5, within each group Gi the nodes are sorted in the increasing
order by postorder(v): Analogously as in Step 5, the sorting takes logarithmic time and
O(n=log n) processors totally. Further, binary search trees on O(n=log n) nodes are computed in O(log n) time using O(n=log n) processors.
We conclude that the algorithm can be implemented to run within the claimed time
and processor bounds. Since each step of the algorithm requires only linear space, the
space complexity follows.
First we discuss the problem of answering a path query and then present an algorithm
for answering k path queries using the above preprocessing step.
Let the query be a pair of nodes (a; b) in T: We are asked to compute the path
between nodes a and b, denoted as path(a; b). In the following we show that path(a; b)
can be computed in O(log n) time using O(maxf1; jpath(a; b)j=log ng) processors, where
jpath(a; b)j denotes the number of nodes in path(a; b).
8

First compute the lowest common ancestor node of a and b in T: Denote it by c. Now
the problem reduces to reporting the paths path(a; c) and path(b; c). So, for simplicity
assume that b is an ancestor of a.
1. Compute the level and the postorder numbers of nodes a and b in T .
2. Compute the set index(a; b) = fijm0 = i mod dlog ne and level(b) i level(a)g.
Locate the groups Gi, where i 2 index(a; b).
3. For each Bi, where i 2 index(a; b), using the postorder number of a; locate the leaf
li of Bi, such that a is in the subtree rooted at li.
4. Compute paths in T from each li to li+dlog ne whenever both i and i + dlog ne are in
index(a; b): Further, compute path(a; li) for the largest value of i 2 index(a; b) and
analogously compute path(li; b) for the smallest value of i 2 index(a; b).
5. Compose the paths computed in Step 4 to obtain path(a; b).

Algorithm 3: An algorithm for computing path(a; b)


Theorem 3.2 Algorithm 3 computes the path between two query nodes a and b, path(a; b),
in O(log n) time using O(maxf1; jpath(a; b)j=log ng) processors on the EREW PRAM,
where b is an ancestor of a and jpath(a; b)j denotes the number of nodes in path(a; b).
(see Figure 1.)

Proof: The correctness of Algorithm 3 immediately follows from the fact that rst
we compute every log nth node on the path and then compute the rest of the path.
The number of nodes in path(a; b) is jpath(a; b)j = level(a) ? level(b). So, we assign O(maxf1; jpath(a; b)j=log ng) processors to compute path(a; b) in O(log n) time.
The set index(a; b) can be computed in O(log n) time. Note that jindex(a; b)j
maxf1; jpath(a; b)j=log ng. The binary search trees Bi are of height O(log n) . Therefore, the ith processor can search for the leaf node li; with the smallest postorder number
greater than that of a; in Bi in O(log n) time. Note that ai is in the subtree rooted at
such a li: Once the leaves li are located in each Bi , where i 2 index(a; b), the remaining
task is to compute the paths from nodes li to li+dlog ne for the appropriate values of i in
T . Since the number of nodes in each path is at most logn, each of these paths can be
computed in O(log n) time. Note that this method requires only exclusive read and write
capabilities, since the nodes in each Bi are pairwise distinct.
Now using the theory developed so far, we discuss the main result of this paper. The
problem is to solve k-path queries simultaneously. Using Lemma 2.5, compute the lowest
common ancestor of each query pair. As before, we can now assume that each query is a
pair of nodes ai and bi of T , where bi is an ancestor of ai, for i = 1::k.
9

li

Gim

a
Figure 1: Figure for Algorithm
1. For each query pair (ai; bi) compute level(ai), level(bi), jpath(ai; bi)j and
postorder(ai ):
2. Compute S = Pii==1k jpath(ai; bi)j.
3. For each query pair (ai; bi), locate the binary search trees Bj such that m0 = j mod
dlog ne and level(bi) j level(ai). Let index(ai; bi) be the set of values which j
takes. For each j 2 index(ai; bi), generate a search pair (Bj ; ai).
4. For each search pair (Bj ; ai), mark the leaf lj of Bj , such that ai is in the subtree
rooted at lj :
5. For each query pair (ai; bi) compute paths in T from lj to lj+dlog ne, for all j 2
index(ai; bi) where j + dlog ne 2 index(ai; bi): Further, compute path(ai; lj ) for the
largest value of j 2 index(ai; bi) and analogously compute path(lj ; bi) for the smallest
value of i 2 index(ai; bi).
6. Compose the paths computed in Step 5 to obtain path(ai; bi).

Algorithm 4: An algorithm for computing k-path queries.


10

Theorem 3.3 Algorithm 4 correctly computes paths between k-query nodes in O(log n +

log k) time using O(k + (n + S )=log n) processors on an EREW PRAM, where S is the
size of the output.

Proof: The correctness is straightforward and follows from the fact that for any

pair of query nodes ai and bi, rst we compute every log nth node on the path and then
determine the remaing logarithmic-length path segments.
Now we analyze the complexity of the algorithm. Using Lemma 2.5, the lowest common ancestors of k node pairs can be computed in O(log k) time using O(k) processors.
Consider Step 1 of the algorithm. For each pair of query nodes, we can compute level
and postorder information in O(1) time on the CREW PRAM. Hence in O(log k) time
using O(k) processors, we can compute the required quantities in Step 1 on an EREW
PRAM for all queries.
The sum of k numbers, in Step 2, can be computed in O(log k) time using O(k=log k)
processors. Once we know the value of S , we assign O(S=log n) processors in O(log S )
time to perform our task. Assign maxf1; jpath(ai; bi)j=log ng processors to each pair of
query nodes (ai; bi).
Consider Step 3. Using a maxf1; jpath(ai; bi)j=log ng processors, we can compute the
set index(ai; bi) and generate the search pair (Bj ; ai) for each j 2 index(ai; bi) in O(log n)
time.
Consider Step 4. We write (Bj ; ai) (Bp; al) if and only if either j < p or j = p
and postorder(ai ) postorder(al ). Sort all search pairs according to the lexicographic
order by the algorithm of Cole 1]. We have at most O(S=log n) search pairs, since for
any path(ai; bi), jindex(ai; bi)j maxf1; jpath(ai; bi)j=log ng. Hence, sorting of the search
pairs requires O(log S ) time using O(S=log n) processors. Now the task is to mark the
leaves in the search tree Bj corresponding to query nodes ai, for all desired values of i and
j . Consider one such search tree, say Bj and let there be l search pairs involving Bj . Using
Lemma 2.6, we can mark the leaves corresponding to the search pairs in O(log jBij +log l)
time using O(l) processors. Since we search for a total of at most O(S=log n) search pairs
and the number of nodes in all the search trees is at most O(n), the overall complexity
of this step is O(log n + log S ) time using O(S=log n) processors.
Consider Step 5. We have several marked nodes in T , which are either the leaves lj ,
marked in the previous step or the nodes ai. The task is to report the path starting from
each marked node x, upto a node y, such that i) y is on the path from x to the root of
T ii) level(y) mod dlog ne = m0 or y = bi and iii) jpath(x; y)j dlog ne. We accomplish
the task as follows. First partition the tree into subtrees Ti by cutting the tree at the
m0 level, and then at every dlog neth level. The resulting subtrees have height at most
dlog ne and the root ri of any subtree Ti; di erent from the top one rooted at the root of
T; satis es level(ri) mod dlog ne = m0. Now the problem reduces to that of reporting all
paths from the marked nodes in Ti to the root ri of Ti, for every subtree Ti of T . Using
Lemma 2.3, we can compute paths in each Ti from each marked node to ri. Note that
in all we have O(n) vertices partitioned among Tis and the total number of nodes to be
reported in all query paths is S . Hence the overall complexity of computing paths from
11

each marked node to the root of its subtree will be O(log n) time using O((n + S )=log n)
processors.
Consider Step 6. For each query path path(ai; bi), we have computed subpaths consisting of at most O(log n) nodes. Note that a subpath may be common to several path
queries. If a subpath is common to l path queries, we can replicate it l times using the
results of Lemma 2.1. The number of times a subpath needs to be replicated can be
computed as follows. While searching for the leaf nodes in the search tree in Step 4, we
can compute how many times each leaf is marked. Since each leaf is marked once for each
search query, the number of times it gets marked gives the number of times a subpath
needs to be replicated. Once we know all the subpaths for a pair of query nodes, we can
easily output the whole path.
All the steps of the algorithm use only exclusive read and exclusive write capabilities,
which proves our result.

4 Conclusion
In this paper we have presented optimal parallel solutions to reporting paths between
pairs of nodes in a tree. Using our results we can also compute the weight of the path
between query nodes in a weighted tree within the same complexity bounds. There
are several related problems which might be of interest. One of them is to consider
the dynamic version where insertions and deletions of tree nodes are possible. Other
interesting problem is to study the lowest common ancestor query problem in the dynamic
setting.

Acknowledgements

The authors thank unknown STACS'94-referees for valuable comments on Lemmata


2.1, 3.1.

References

1] R. Cole. Parallel merge sort. SIAM J. Computing, 17, (1988), pp. 770-785.

2] R. Cole and U. Vishkin. The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time. Algorithmica, 3 (1988), pp. 329-346.
3] A. Datta, A. Maheshwari, J.-R. Sack. Optimal parallel algorithms for direct dominance
problems. First Annual European Symposium on Algorithms, Lecture Notes in Computer
Science, Vol. 726, pp. 109-120, Springer-Verlag, 1993.
4] H.N. Djidjev, G.E. Pantziou and C.D. Zaroliagis. Computing Shortest Paths and Distances in Planar Graphs. Proceedings 18th ICALP, Madrid, 1991, LNCS 510, pp. 327-338,
Springer Verlag.

12

5] M. T. Goodrich. Intersecting line segments in parallel with an output-sensitive number of


processors. SIAM J. Computing, 20, (1991), pp. 737-755.
6] L. J. Guibas and J. Hershberger. Optimal shortest path queries in a simple polygon. J. of
Computer and System Sciences 39, pp. 126-152, 1989.
7] R. Guting, O. Nurmi and T. Ottmann. Fast algorithms for direct enclosures and direct
dominances. J. Algorithms 10 (1989), pp. 170-186.
8] J. JaJa. An Introduction to Parallel Algorithms. Addison-Wesley, 1992.
9] R. M. Karp and V. Ramachandran, Parallel Algorithms for Shared-Memory Machines,
Handbook of Theoretical Computer Science, Edited by J. van Leeuwen, Volume 1, Elsevier
Science Publishers B.V., 1990.
10] W. Paul, U. Vishkin and H. Wagener. Parallel dictionaries on 2-3 trees. Proc. 10th ICALP,
Lecture Notes in Computer Science, Vol. 154, pp. 597-609, 1983.
11] B. Schieber and U. Vishkin. On nding lowest common ancestors: Simpli cation and Parallelization. SIAM. J. Computing, 17 (1988), pp. 1253-1262.
12] R.E. Tarjan. Data Structures and Network Algorithms. SIAM, Philadelphia, 1983.
13] R. E. Tarjan and U. Vishkin. An e cient parallel biconnectivity algorithm. SIAM J. Computing, 14 (1985), pp. 862-874.

13

You might also like