You are on page 1of 6

2009 Eighth International Conference on Grid and Cooperative Computing

Spatial Queries Evaluation with MapReduce


Shubin Zhang1, 2 , Jizhong Han1 , Zhiyong Liu1 , Kai Wang1, 2 ,Shengzhong Feng3
1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, P. R. China
2. Graduate School of the Chinese Academy of Sciences, Beijing, 100190, P. R. China
3. Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, ShenZhen, 518055, P. R. China
{zhangshubin, hjz, zyliu, wangkai2008}@ict.ac.cn, sz.feng@siat.ac.cn

Abstract to benet from system mechanisms for communication, load


Spatial queries include spatial selection query, spatial join balancing, task scheduling and fault tolerance.
query, nearest neighbor query, etc. Most of spatial queries The MapReduce framework is simple and it effectively
are computing intensive and individual query evaluation may supports parallelism, so it is widely used in Google for
take minutes or even hours. Parallelization seems a good so- data mining, machine learning and many other applications.
lution for such problems. However, parallel programs must However, MapReduce is mainly used in data-intensive appli-
communicate efciently, balance work across all nodes, and cations, and little attention has been paid to the computing-
address problems such as failed nodes. We describe MapRe- intensive applications in clusters. In this paper, we discuss
duce and show how spatial queries can be naturally expressed parallelizing compute-intensive spatial applications with MapRe-
in this model, without explicitly addressing any of the de- duce.
tails of parallelization. We present performance evaluations Drastic controversy had been carried through between the
for several spatial queries and prove that MapReduce is also MapReduce and RDBMS [4], we prefer MapReduce because
appropriate for small scale clusters and computing intensive the model is stunningly simple, and it effectively supports
applications. parallelism. The programmer may get away from the is-
sues of distributed and parallel programming, because it is
1. INTRODUCTION the MapReduce implementation that takes care of load bal-
In order to provide efcient spatial data computing and ancing, data distribution, fault tolerance, resource allocation,
management capabilities, a variety of spatial Data Man- and le distribution, etc. The programming model provides
agement Systems (DBMS) based on single-node Relational a good t for many problems encountered in the practice
DBMS (RDBMS) have been put forward. The two most pop- of processing large data sets. But some spatial queries like
ular types of architectures are RDBMS + spatial data engine spatial join can not be parallelized using MapReduce easily.
(SDE) and extended Object-Relational DBMS. Above all, In this paper, we investigate how spatial queries can be de-
almost all commercial spatial data management systems are composed and processed using MapReduce to optimize the
built on DBMS. performance, such as response time. The main contributions
With the advancement of remote sensing and mapping of this work are: algorithms for processing spatial queries
technologies, we are currently on the verge of dramatic using MapReduce; performance evaluation, compared to Or-
changes in both the quantity and complexity of geo-spatial acle Spatial [5], demonstrating the feasibility of parallelizing
data sets. Due to the complexity of spatial computing and spatial queries processing with MapReduce.
extremely large data volumes, it is difcult to satisfy the per- The remainder of this paper is organized as follows. Sec-
formance, scalability and reliability requirements of spatial tion 2 introduces MapReduce and related works in spatial
applications with single-node DBMS. Whats more, Most of query area. Section 3 describes the design of spatial query
spatial queries are computing intensive and individual query processing algorithms with MapReduce. The performance
evaluation may take minutes or even hours. Therefore, it is evaluation is presented in Section 4. Finally, Section 5 con-
a good choice to apply parallel processing technologies to tains our conclusions.
improve performance of spatial query processing.
Unfortunately, all parallel programs face a wide range of 2. BACKGROUND AND RELATED WORK
problems. Inefcient communication or poor load balancing
can keep a program from scaling to a large number of nodes. 2.1. MapReduce
Parallel DBMS faces the same problem, and is not suitable MapReduce is a programming model and computing plat-
for the expansion of data node number. form well suited to parallel computation [1]. In MapReduce,
MapReduce [1] is a remarkable parallel programming a program consists of a map function and a reduce function
model proposed by Google for processing large data sets which are user-dened. The input data format is application-
in a massively parallel manner. Through a simple interface specic, and is specied by the user. The output is a set
with two functions, map and reduce, this model facilitates of <key, value> pairs. The MapReduce signatures are listed
parallel implementation of many real world tasks. MapRe- here:
duce provides many benets over other parallel processing map : (k1 , v1 ) [(k2 , v2 )]
(1)
models like PVM [2] and MPI [3]. It allows simple programs reduce : (k2 , [v2 ]) [k3 , v3 ]

978-0-7695-3766-5/09 $25.00 2009 IEEE 295


287
DOI 10.1109/GCC.2009.16
As shown in the signature, the map function applies user- Became famous with google, MapReduce is used in a
dened logic on every input key/value pair (k1 , v1 ) and trans- wide range of applications, such as web access log stats,
forms it into a list of intermediate key/value pairs [(k2 , v2 )]. web link graph reversal, machine learning, statistial machine
Then the reduce function applies user-dened logic to all translation and so on. There are usually hundreds of com-
intermediate values [v2 ] associated with the same k2 and puters for such a calculation. MapReduce is not only imple-
produces a list of nal output key/value pairs [k3 , v3 ]. mented for large cluster computing, but used in multi-core
Figure 1 shows an example of MapReduce application, and other environments. Phoenix [6] is an implementation
calculating the settlements passed through by each river. The of MapReduce model to program multi-core chips as well as
curves in the Figure 1 denote rivers, and polygons denote shared-memory multiprocessors. MapReduce has also been
settlements. implemented on GPUs, such as Mars [7]. This paper demon-
strates that MapReduce can also achieve good performance
in computing-intensive spatial applications and small-scale
clusters.
Hadoop [8] is a distributed open source platform, based on
Google technology. It is composed of the Hadoop Distributed
File System(HDFS) and MapReduce program model. Our
implementation and experiments are based on this platform.
2.2. Spatial Queries
There are several kinds of spatial queries, such as spa-
tial selection query, spatial join query and nearest neighbor
query. They are basic operations of a large number of spatial
applications.
//input: spatial objects (rivers and settlements), 2.2.1. Spatial Selection Query
// key=spatial object ID,
// value=spatial object property It is difcult to nd a compact set of operations fulll-
ing requirements of all spatial applications. Spatial selection
map(Integer key, String value): queries are of great importance within spatial queries and
for each value:
EmitIntermediate(river ID, "ID:property"); operations. They serve as basis for many other spatial op-
erations. Therefore, efcient evaluation of spatial selection
//intermediate: key=river ID, queries are important for overall performance of spatial data
// value=spatial object ID:property
management systems. The two main representatives of spa-
reduce(Integer key, Iterator values): tial selection queries are point query and region query:
//including the Shuffle step a)Point Query
for each value sharing the same key:
Emit(AsString(value));

//output: settlements passed through by the same Q(P ) = {O|O.contains(P ), O M } (2)


// river
Given a query point P and a set of objects M, the point query
nds all the objects of M that geometrically contains P.
b)Region Query
Figure 1. MapReduce Example
Q(P ) = {O|O.G R.G 6= , O M } (3)
The MapReduce runtime accomplishes the parallelization Given a query region R and a set of objects M, the region
of map and reduce operations by dividing the responsibilities query nds all the objects of M that geometrically interact
of those operations among many tasks. Input data are parti- with R. A special case of the region query is the window
tioned among multiple mappers, which are tasks responsible query. The query region of window query is given by a
for applying the map function to a portion of the input data. rectilinear rectangle.
Mappers retrieve their input data from a distributed le sys-
tem and write tuples emitted to a collection of les on local 2.2.2. Spatial Join Query
disk. These les are then transferred to another group of Spatial join query can be used for implementing map
tasks called reducers. Each mapper produces a single le for overlay. The importance of spatial join is comparable to the
a reducer, and the tuples in such a le are all tuples whose natural join in a relational DBMS. The spatial join operation
keys produce the reducers ID when hashed. Once a reducer combines objects from two spatial data sources according
has received its les from all mappers, it sorts and merges to their geometric attributes, i.e. their geometric attributes
the les by key and reduces each key in turn, outputting the have to fulll some spatial predicate. For example, a spatial
resulting tuples to les on the distributed le system. join answers such query as which road across some river in

288
296
China, given a road table and a hydrography table of China.
The expression of spatial join is

SJ(R, S) = {(r, s)|r.join(s), r R, s S} (4)


Spatial predicates may be intersection, containment or
within distance, etc. The most popular spatial join query is
the intersection join where the predicate is the intersection.
In this paper, we discuss only the intersection join. However,
our results can be easily transferred to spatial joins using
other spatial predicates.
Efcient spatial join algorithms are the focus of spatial
database research. However, parallel spatial join processing
has not been studied extensively. PPBSM (Parallel Partition
Based SpatialMerge join) algorithm [9, 10, 11] is one of the Figure 2. Spatial Selection with MapReduce
classical parallel spatial join algorithm.

2.2.3. Nearest Neighbor Queries both query inputs could be intermediate results of complex
There are two main kinds of Nearest Neighbor queries: the queries), you have to create index on line. Some new tech-
k Nearest Neighbor Query (kNN) and All Nearest Neighbor niques are proposed for ANN evaluation when one or both
Query (ANN). indexes do not exist.
a)kNN:
3. PARALLELIZING SPATIAL QUERIES EVALUA-
The goal of kNN query is to nd the k objects in a data
set M that are closest to a query point q. Existing algorithms TION WITH MAPREDUCE
presume that the data set is indexed by an R-tree and use Since the representation of a spatial object can be very
various metrics to prune the search space. Nearest Neighbor large and complex, at the same time, spatial queries is com-
Query is special case of kNN, where k is 1. The Nearest plex and time-consuming. Spatial queries are typically eval-
Neighbor Query is dened as uated in two steps:
Filter Step: An approximation of each spatial object,
such as its minimum bounding rectangle (MBR), is used
1N N (q) = {o|o0 : dist(q, o) dist(q, o0 ), to check with the spatial predicate, eliminating tuples that
(5)
o M, o0 M } cannot be part of the result. This step produces candidates
Then kNN(k > 2) is that are a superset of the actual result.
Renement Step: Each candidate pair in the results of
kN N (q) = {o|o0 : dist(q, o) dist(q, o0 ), last step is examined to check whether their spatial properties
o M (k 1)N N (q), o0 M (k 1)N N (q)} (6) satisfy the spatial predicate. A CPU-intensive computational
{(k 1)N N (q)} geometry algorithm is generally used in this examination.

b)ANN: 3.1. Spatial Selection Queries


Let A and B be two spatial data sets and dist(a, b) be a The spatial selection queries, including point query and
distance metric. Then, the ANN [12] query is dened as: region query, could be resolved with one MapReduce job,
which includes one Map function only.
AN N (A, B) = {< ai , bj > |ai A : bj B, In the Map function, the lter-and-renement strategy
(7)
bk A{dist(ai , bk ) < dist(ai , bj )}} can be used to nd the objects in the input split meeting the
The ANN query nds for each object in A its nearest neighbor requirement of point query or region query. The results of the
in B. Notice that ANN(A, B) is not commutative, i.e., in Map stage are stored in the distributed le system directly.
general, AN N (A, B) 6= AN N (B, A). A naive approach to As shown in Fig 2, input spatial objects are split into sev-
process an ANN query is to perform one NN query on data eral map tasks. Every object is examined whether contains
set A for each object in data set B. the query point or intersects with the query region. If one
The ANN problem has been studied in the context of com- object matches the query condition, the Map function output
putational geometry, where several main memory techniques the objects ID(OID).
have been proposed for the case where A=B, i.e., the nearest 3.2. Spatial Join
neighbors are found in the same data set. The Spatial Join with MapReduce(SJMR) algorithm ho-
Many previous methods for ANN evaluation in secondary mogenizes the objects from different data sources at the be-
memory are inefcient and applicable only when both A and ginning of the Map stage. Then the Map function puts the
B are indexed, which is difcult in practice (e.g., one or objects into disjoint tiles and merges the tiles into buckets

289
297
Spatial join is carried out in two steps: lter step and
renement step.
Filter Step
The lter step of Reduce stage begins by reading interme-
diate value v2 of every tuple belonging to the same partition
k2 .
The main goal of the lter step is to "pair" tuples from
the same partition such that their MBRs overlap. Given two
sets of rectangles, such that both the sets t entirely in main
memory, efcient plane sweeping techniques exist for report-
ing all pairs of intersecting rectangles between the two sets.
A plane sweeping algorithm can be used to nd all pairs of
Figure 3. Spatial Join with MapReduce key-pointer elements that have overlapping MBRs. For such
"matching" key-pointer element pairs, the OID information
is extracted and added to the output of this step.
with the Z-Order algorithm at the Map stage. The tile and To make full use of location information of tuples got by
bucket information of every object is also saved to be used splitting, SJMR adopts a strip-based plane sweeping method.
at the Reduce stage. The bucket number of every object is In this method, every partition is divided up equally into
used as the intermediate key of every object and the object is several strips according to tiling information. The strips are
set as the intermediate value. side by side and paralleling with x-axis. Then every strip
With the help of the MapReduce framework, objects with is ltered with a plane sweeping algorithm. Every tuple is
the same bucket number are shufed to the same Reduce divided into the strips with its tiling information.
task. Then at each Reduce task, every bucket is joined using Renement Step
a new strip-based plane sweeping algorithm which make full
use of the location information contained in tile information After ltering the partition belonging to this Reduce task,
of every object. The reference tile method is used to avoid the result is a temporary relation whose tuples have the form
generating duplicates at the same time. The principles of < OIDR , OIDS >, such that MBR of the tuple correspond-
SJMR can also be used where neither of the inputs have ing to OIDR overlaps with MBR of the tuple corresponding
spatial index and in other parallel environments. The data to OIDS .
ow of SJMR is Fig 3. The renement step examines the spatial properties of
OIDR and OIDS to see if the tuples actually satisfy the
The methodology used in spatial join is important and can
join condition. A strategy same as that used in [9, 13] is
be extensive use in other spatial operations. Therefor, SJMR
employed.
is described in detail.
3.3. Nearest Neighbor Queries
3.2.1. Map stage
The tuples of R and S are distributed about the data nodes 3.3.1. kNN
according to HDFS (Hadoops Distributed FileSystem) dis- The kNN query could be resolved with one MapReduce
tributing strategy. So the rst stage of SJMR needs to re- job. In each Map task, the kNN of the query point in this
distribute the tuples of R and S into different Reduce tasks input data split are computed and emitted to the MapReduce
according to spatial partitioning function. The goal is to dis- framework. Distance between the query point and every
tribute the tuples so that each Reduce task performs roughly object is calculated in map function. ID of every kNN object
equal work and the distribution will not affect the validity of is set as the intermediate key, and the distance is set as the
results. intermediate value. There is only one Reduce task for this
The ltering step of Reduce stage need the tiling infor- job. So the MapReduce framework will shufe all Map
mation of every tuple, so the tiling information need to be tasks output and merge them at the same Reduce task. The
saved as another attribute of every tuple. So intermediate intermediate < key, value > pairs will be sorted together as
value v2 of every tuple now have common attributes: OID, a list according to the distances. Then the Reduce function
MBR, spatial property, data-source and tiling information. will select the k objects with the shorest distance from the
Now Reduce task can apply to all the homogenized data list.
sets combined. Tuples from different data sets with the same
key k2 will be grouped in the same Reduce task. User-dened 3.3.2. ANN
logic can extract data-sources from values to identify their ANN is complex and two MapReduce programs are needed,
origins, and the entries from different sources can be merged. the algorithm following can be used in many spatial applica-
tions such as closest pairs queries [14], spatial distance join
3.2.2. Reduce stage [15].

290
298
ANN queries constitute a hybrid of nearest neighbor search 4.1.1. Spatial Selection Queries
and spatial joins; therefore, the implementation of ANN is Spatial index accelerates evaluation of spatial selection
similar with spatial join. Due to the limited paper space, queries, especially point query. Without reading hard disk,
ANN is not discussed detailedly as SJMR in this paper. point query in Oracle Spatial consumes no more than one
The Map task of the rst MapReduce job will partition second. Quering random point with MapReduce in the Hy-
A and B into different buckets just as spatial join as SJMR. drography data set takes 24.43 seconds when the node num-
The Reduce task of the rst MapReduce job will generate ber is 6. But considering the time creating index, the time of
two kinds of output les: the rst part of result les and the point query evaluation with MapReduce on unindexed data
pending les. The rst part of result les contain information set is acceptable.
about the objects whose nearest neighbor has been found in When we evaluate region selection query, three rectangu-
the bucket corresponding to this Reduce task. The pending lar regions are created, and the sizes of rectangles are 38.86%,
les contain information about the objects in A, whose near- 1.03% of the MBR of Road data set. The performance of
est neighbor are not guaranteed to be in the corresponding different region queries are shown in Figure 4, Figure 5.
blocks. An entry in the pending les contains the object, its
current nearest neighbor and its current N N dist(ai , B).
Every Map task of the second MapReduce program will
shufed the pending les completely and nd the nearest
neighbor of every object in the corresponding block of B.
Then use the object Ap in pending les as intermediate key,
and strings combining N N (Ap , BlockB) with
N N dist(Ap , BlockB) as intermediate value. Every inter-
mediate value belonging to the same object Ap will be sent
to the same Reduce task and merged together as a list. The
Reduce task will nd N N (Ap , B) from the list.

4. PERFORMANCE EVALUATION
Figure 4. Performance of Region Query in Large Window
4.1. Experiment environment and data sets
The Experiments were performed with Hadoop 0.18.1
running on DELL Power Edge SC430 Servers each with
Intel Pentium 4 2.80 GHz processor, 1GB main memory,
80GB SATA disk, running the RedHat AS4.4 with kernel
2.6.9 Operating System and ext3 File System.
The experiment results were achieved on different cong-
urations where each node acted as a tasktracker and datanode,
with an independent server as jobtracker and namenode.
The experiments used two data sets from TIGER/Line les
[16]. One of the data sets includes the road information of
California, and the other includes hydrography in California.
Below is the statistics of the two data sets.

Table 1. California TIGER data Figure 5. Performance of Region Query in Small Window
Data Set Num of objects Size Average point number
Road 2092079 529MB 3.87
Hydrography 373950 135MB 10.77
The performance draw the conclusion that the time spent
on region query with MapReduce is related to the size of
In contrast to the algorithm implemented with MapRe- region, because the large query window leads to large set
duce, we also evaluate spatial queries in Oracle Spatial. The of results from the lter stage and the high cost of rening
Oracle Database version is 11g Release 1. When querying stage. While the time spent on region query in Oracle Spatial
spatial data in Oracle Spatial, spatial index is necessary in is positive correlation to the size of region, because the query
database. After spatial data have been loaded into the Oracle, algorithm in Oracle Spatial is based on spatial index. When
a spatial index should be created to enable efcient spatial the query region is large enough, the spatial index loses
query processing with the data. Create the index in Road effectiveness, and the performance of Oracle Spatial is worse
takes 411.4 seconds and Hydrography takes 61.35 seconds. than region query with MapReduce.

291
299
The execution time of region query and number of nodes 6. ACKNOWLEDGEMENT
are in the approximate linear relationship. When the node This work was supported by National Basic Research Pro-
number is 6, the time consumed of region query with MapRe- gram of China 973 Program( Grant No. 2004CB318202 and
duce is 9.38% of that in Oracle Spatial, take the time of 2007CB310805), National Natural Science Foundation of
creating index into consideration. China (Grant No. 60752001), National High-Tech Research
and Development Program of China (2009AA12Z226).
4.1.2. Spatial Join Query
REFERENCES
[1] J. Dean and S. Ghemawat, MapReduce: Simplied Data Processing
on Large Clusters, Communications of the ACM, vol. 51, no. 1, pp.
107, 2008.
[2] A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam,
PVM: Parallel virtual machine: a users guide and tutorial for
networked parallel computing, MIT Press Cambridge, MA, USA,
1995.
[3] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel
Programming with the Message Passing Interface, MIT Press, 1999.
[4] David J. DeWitt and Michael Stonebraker, Mapreduce: A major
step backwards, Jan 2008.
[5] Chuck Murray, Oracle Spatial, Users Guide and Reference, 2005.
[6] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary
Bradski, and Christos Kozyrakis, Evaluating mapreduce for
multi-core and multiprocessor systems, hpca, vol. 0, pp. 1324,
2007.
Figure 6. Performance of SJMR [7] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and
Tuyong Wang, Mars: a mapreduce framework on graphics
processors, in PACT 08: Proceedings of the 17th international
Figure 6 shows the impact of different node numbers on conference on Parallel architectures and compilation techniques,
the performance of SJMR. From this gure, it could be con- New York, NY, USA, 2008, pp. 260269, ACM.
cluded that the performance of the SJMR algorithm has direct [8] A. Bialecki, M. Cafarella, D. Cutting, and O. OMalley, Hadoop: a
framework for running applications on large clusters built of
relationship with node number. With the increase of node commodity hardware, Wiki at http://lucene. apache. org/hadoop.
number, performance of SJMR improves obviously. With [9] J.M. Patel and D.J. DeWitt, Partition based spatial-merge join, in
the analysis of SJMR, it could be deduced that better perfor- Proceedings of the 1996 ACM SIGMOD international conference on
mance would be obtained with more nodes added. Signif- Management of data. ACM New York, NY, USA, 1996, pp. 259270.
icantly, the performance of SJMR shows a good scalability [10] J.M. Patel and D.J. DeWitt, Clone join and shadow join: two
with node number. parallel spatial join algorithms, in Proceedings of the 8th ACM
international symposium on Advances in geographic information
Compared with Oracle Spatial, SJMR presents a signi- systems. ACM New York, NY, USA, 2000, pp. 5461.
cant superiority. Parallel spatial algorithm with MapReduce [11] J. Patel, J.B. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N. Hall,
shows efcient performance especially when the application K. Ramasamy, R. Lueder, C. Ellmann, et al., Building a scaleable
is data-intensive and computing-intensive. geo-spatial DBMS: technology, implementation, and evaluation,
ACM SIGMOD Record, vol. 26, no. 2, pp. 336347, 1997.
[12] Jun Zhang, N. Mamoulis, D. Papadias, and Yufei Tao,
5. CONCLUSIONS All-nearest-neighbors queries in spatial databases, June 2004, pp.
This paper studies evaluating spatial queries with MapRe- 297306.
duce. We describe how spatial queries can be naturally ex- [13] P. VALDURIEZ, Join Indices, ACM Transactions on Database
pressed with MapReduce programming model, without ex- Systems, vol. 12, no. 2, pp. 218246, 1987.
plicitly addressing any of the details of parallelization. [14] Yannis Manolopoulos, Closest pair queries in spatial databases, in
In Proceedings of the ACM-SIGMOD Conference on Management of
The performance evaluation demonstrates the feasibil- Data, 2000, pp. 189200.
ity of processing spatial queries with MapReduce. Com- [15] Hyoseop Shin, Bongki Moon, and Sukho Lee, Adaptive multi-stage
pared with Oracle Spatial, a traditional spatial data manage- distance join processing, in In SIGMOD, 1999, pp. 343354.
ment architecture, spatial queries evaluation with MapRe- [16] UC Bureau, Census 2007 Tiger/Line data, 2007.
duce achieved more efcient performance in spatial selection
queries and spatial join query, especially where no indexes
on data sets.
In the near future, we will optimize spatial query algo-
rithms with MapReduce, and add new spatial analysis opera-
tors. We are also improving the MapReduce model with the
purpose that MapReduce would be more suitable to imple-
ment spatial applications.

292
300

You might also like