You are on page 1of 1

2010-2011 Facebook Computer Science Clinic Project

Improving the Hive Database System


Facebook Automatic Index Usage Bitmap Indexing Deliverables
Facebook has over 500 million active users, who generate Using indexes was difficult in the previous Hive system, as it Bitmap indexing is an indexing technique that is effective for We have delivered the following products to Facebook:
terabytes of data every day. Analyzing this data presents a required the user to understand how each type of index was columns that hold few distinct values. Examples of such columns
• Source code patches for our work on automatic indexing and
challenge unsolvable with traditional database systems. implemented. Using an index to speed up a query required the that may be present in Facebook’s databases include genders
bitmap indexing.
Facebook chose to create the Hive distributed database system user to: and relationship statuses.
to address their needs. • Documentation of our new features on their wiki
• Query the index to produce an intermediate file of relevant
A bitmap index uses a series of binary bit vectors to represent a • Benchmarking results shown below
regions of the table. The form of this intermediate query
• Hive exists on top of Hadoop, an open source distributed column. The index uses one bit vector for each possible value of
depends on the specific type of index being used.
computation framework.
• Set Hive to scan only the regions referenced by this
the column. Each value in the vector represents a row and it is set Benchmarking Results
• It provides a familiar SQL-like syntax and table based storage to to 1 if the row contains the value of the vector and 0 otherwise.
intermediate file. Facebook requested we conduct benchmark tests to test the
the users.
Speedup then comes from Hive having only to read in specific Bitmap indexes are powerful because bit vectors can be efficiently efficacy of the indexing framework. We executed test queries on
• It enables high-throughput queries on massive datasets.
parts of the table to evaluate the query, instead of having to read combined using bit-wise operations, quickly eliminating rows that columns with different numbers of distinct values:
in the entire table. need to be accessed in combination queries.
SELECT user_id, gender FROM Data • Many Distinct Values (e.g. user_id): Both indexing methods
WHERE age="22"
Automatic index usage allows users to benefit from indexes showed significant improvement. Compact was better than
User Statistics Bitmap, as expected.
without having to understand implementation details. Our team user_id gender browser os
Hive: QL queries worked on allowing indexes to be automatically used to speed up 100 Male Chrome Linux • Few Distinct Values (e.g. browser): Both index methods also
queries that contain WHERE clauses. 101 Female Firefox Linux showed significant improvement. Bitmap was better than
102 Female Chrome Windows Compact, as expected
Hadoop: MapReduce Jobs
Automatic index usage was implemented as a stand alone 103 Male Safari Mac OS X
• Average Distinct Values (e.g. access_date): On a column
optimization. It receives a graph of MapReduce jobs and works 104 Male Firefox Windows
with an average number of distinct values, if many results are
by: 105 Female Chrome Linux
returned from the query, indexing does not give any
106 Male Safari Windows
• Determining if the graph came from a query that can be sped up 107 Male Firefox Windows advantage. If few results are returned from the query, both
using an index. index methods are helpful.
• Generating additional MapReduce jobs that do the intermediary gender browser os
work of querying the index. Male Female Chrome Firefox Safari Linux Windows OS X
All tests were conducted using a 5GB table with 45 million rows
file file file file file file file file
1 0 1 0 0 1 0 0 from the generic TPC-H dataset, not user data from Facebook.
file file file file
• Augmenting the original graph of jobs with the new jobs.
0 1 0 1 0 1 0 0
0 1 1 0 0 0 1 0
300 No Indexes
Query Evaluation

Query Execution Time [s]


1 0 0 0 1 0 0 1 Compact Index
1 0 0 1 0 0 1 0
Answer!
225 Bitmap Index
0 1 1 0 0 1 0 0
Hive works by compiling query statements into a directed acyclic 1 0 0 0 1 0 1 0
graph of MapReduce jobs, which can then be run on the 1 0 0 1 0 0 1 0
Hive is a database layer on top of Hadoop, which uses underlying Hadoop cluster. The compiler is divided into various 150
MapReduce to distribute computation across multiple sections that are shown in the figure below.
servers. Bitmap Index
Our work on automatic index usage was focused on the This shows the layout of a bitmap index. There is a bit vector 75
Optimizer. The Optimizer applies a series of optimizations, each column for every option in every column of the original table.
of which rearranges the graph of Map-Reduce jobs to improve
Indexing query run-time. Male, 0
Many Distinct Few Distinct Many Values Few Values
Male Firefox Windows
Firefox, User Statistics
& Windows Values Values Returned Returned
Indexing is a technique that can be used to improve data lookup SELECT user_id, Semantic
user_id gender browser os
Parser 1 0 0 0 100 Male Chrome Linux
times in a table. An index is an auxiliary data structure that gender FROM Data
WHERE age="22"
Analyzer
0 1 0 0 101 Female Firefox Linux Execution time for different numbers of distinct
provides a faster means to access rows of a table by using the 0 0 1 0 102 Female Chrome Windows values in column Col for queries of type
values of a particular set of columns as a key. Parse Tree
1 AND 0 AND 0 = 0 103 Male Safari Mac OS X

OP
1 1 1 1 104 Male Firefox Windows SELECT * FROM Data WHERE Col = val
0 0 0 0 105 Female Chrome Linux
Hive supported a rudimentary indexing framework that only Logical Plan OP OP
Physical Plan
Acknowledgments
1 0 1 0 106 Male Safari Windows
Generator Generator
contained a single type of index. The Facebook clinic team OP OP
1 1 1 1 107 Male Firefox Windows

worked on improving this framework by adding support for bitmap Logical Plan
indexes, and by adding support for using indexes automatically MR MR
Facebook Liaisons:
when running queries. 1 3
MR
MR
1 MR MR Using the bitmap index eliminates many rows when joining Jonathan Hsu ’01, John Sichi, Yongqiang He
5 Optimizer MR 3 4 Hadoop!
predicates in the query:
MR MR
2 4 2

Optimized Plan Team Members:


Physical Plan SELECT * WHERE gender=Male AND browser=Firefox Skye Berghel, Jeffrey Lym, Russell Melick, Marquis Wang
AND os=Windows
Faculty Advisor: Robert Keller
The Hive Compiler

You might also like