You are on page 1of 31

HIVE AND PRESTO

FOR BIG DATA


ANALYTICS IN THE
CLOUD
Siva Narayanan
Qubole
snarayanan@qubole.com
@k2_181
`WHOAMI`
PhD in Large-scale scientific data management
Parallel query processing,
Greenplum Parallel Database
Hadoop, Hive, Presto at Qubole
Niche.
Scientific simulation apps
Fortune Companies
Small and medium
enterprises
WHATS NEW ABOUT BIG DATA YOU
SAY
Traditionally, analytics on data internal to an organization
Customer data
ERP data
Some pre-digested external data like market research
Sophisticated analytics using new data sources
Social data
Website data
Low density, fine grained and massive
Most EDWs are < 2TB
LOW DENSITY, HIGH VOLUME DATA
Amul comment data: 18000 * 140 * 60 * 24 * 30 = 100 GB per month
Category Unique visitors
Retail Luxury
goods
20 million
Retail
consumer
goods
30 million
Retail tickets 26 million
Social Media Website data
Traditional technologies cannot handle this low-density, high volume data
SKELETON OF A BIG DATA PROJECT
Internal
Data
External
Data
TB - PBs
Actionable report
Analytics Workflow
HOW DO THE BIG GUYS DO IT?
Build data centers
Buy or build custom big-data software
Hire ETL engineers who manage bringing data into the system
Hire admins to keep it all running
Hire data scientists to come up with interesting questions
Hire developers who can translate questions into programs
Lots of upfront investment
Long time to get started
Lots of risks
BIG DATA PROJECT ENTAILS
LANDSCAPE IS CHANGING
Advent of public clouds
Cheap, reliable storage
Provision 10-1000s of machines in a couple of minutes
Pay as you go, grow as you please
Free / inexpensive big-data software
Hadoop, Hive, Presto
CLOUD PRIMITIVES
Persistent object store e.g. AWS S3
Reliability is basically solved for you (*)
Ability to provision clusters with pre-built images in a couple of minutes
Pay by the hour (or by the minute)
Spot instances (AWS)
Relational DB as a Service
MySQL, PostgreSQL etc
THE CLOUD CAN HANDLE YOUR DATA
CLOUDS COMPUTE FLEXIBILITY
Analytics workloads tend to be bursty
Most orgs struggle to predict usage 2-3 months down the line
Tend to overprovision compute
Result: < 30% utilization of their hardware
Cloud allows you to scale up and down
Trickier for a big data system, but possible
Chen et al, VLDB
2012
Provision for peak workload
BIG DATA SOFTWARE
Many open source projects
Hadoop based on Googles MR paper (Yahoo)
Hive (SQL-on-Hadoop)
Presto (Fast SQL)
Production ready, running at scale at Yahoo, FB and many other
environments
ENTER HADOOP
Open-source implementation of Map-reduce used by Google to index
trillions of web pages
Allows programmers to write distributed programs using map and
reduce abstractions
Ability to run these programs on large amounts of data
Uses bunch of cheap hardware, can tolerate failures
HADOOP SCALES!
HIVE: SQL ON HADOOP
Facebook had a Multi Petabyte Warehouse
Had 80+ engineers writing Hadoop jobs
Files are insufficient data abstractions
Need tables, schemas, partitions, indices
SQL is highly popular
So, implement SQL on top of Hadoop
Allowed non-programmers to process all the data
FB open-sourced it
Production ready
Processes 25PB of data in FB
Processes 20PB of data at Qubole
HIVE ALLOWS YOU TO DESCRIBE DATA
Example
My data lives in Amazon S3 in a specific location
It is in delimited text format
Please create a virtual table for me
Number of data formats: JSON, Text, Binary, Avro, ProtoBuf, Thrift
Analytics is often a downstream process
Conversion of data is time consuming and not productive
create external table nation (N_NATIONKEY INT, N_NAME STRING,
N_REGIONKEY INT, N_COMMENT STRING)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION 's3n://public-qubole/datasets/tpch5G/nation';
HIVE EXTENSIBILITY
Connect to external data sources like MongoDB
Write code to understand new data formats - serdes
Custom UDFs in Java
Plug in custom code in python or any other language
SELECT
TRANSFORM (hosting_ids, user_id, d)
USING 'python combine_arrays.py' AS (hosting_ranks_array, user_id, d)
FROM s_table;
HIVE ALLOWS YOU TO QUERY THE DATA
SQL-Like
Query is parallelized using Hadoop as execution engine
Select count(*) from nation;
Count(*)
Count(*)
Sum()
HIVE EXECUTION
Split Hive query into multiple Hadoop/MR jobs
Run Job 1, save intermediate output to HDFS
Run Job 2..
Return results
Data parallel because every hadoop job runs on number of machines
T1
1
100MB
T1
2
100MB
10 files
5 files
5 files
TASK PARALLELISM
T1 T2 T3
100MB 100MB 100MB
10 files
EXECUTION MODEL 1
T1
100MB
T2
100MB
T3
100MB
10 files
Only 100MB of memory required
Can stop and resume
Allows for multiplexing multiple pipelines
Can tolerate failures
Spilling can be expensive
Time to first result is high
EXECUTION MODEL 2
T1
100MB
T2
100MB
T3
100MB
10 files
Task parallelism
Needs 3X memory
No spilling, hence much faster
Early first results
Stop and resume is trickier
Multiplexing is more difficult
Cannot tolerate failures
ENTER PRESTO
Hive was EM1 and had associated disadvantages
Internal project at Facebook to implement EM2 (Presto)
Use case was interactive queries over the same data
Open sourced late 2013
Promised much faster query performance
In-memory processing, aggressive pipelining
Supports all the data formats that Hive does
Cant plug in user code at this point, vanilla SQL
CONTRASTING HIVE AND PRESTO
Hive Presto
Uses Hadoop MR for
execution (EM1)
Pipelined execution model
(EM2)
Spills intermediate data to FS Intermediate data in memory
Can tolerate failures Does not tolerate failures
Automatic join ordering User-specified join ordering
Can handle joins of two large
tables
One table needs to fit in
memory
Supports grouping sets Does not support GS
Plug in custom code Cannot plug in custom code
More data types Limited data types
Hive 0.11 vs Presto 0.60
PERFORMANCE COMPARISON
Presto is 2.5-7x faster
But, some queries just run out of memory
Contrasts the execution models
IN A NUTSHELL
SAMPLE SETUP
Cloud Storage
Sqoop
Application
Sync
Heavy duty queries Interactive queries
CRYSTAL BALL
Hive is actively working on task parallelism as part of the Stinger
Initiative
Presto is also making rapid progress in bridging some of its gaps
There are other open source projects:
Impala, Shark, Drill, Tajo
Lots of goodies for users
CONCLUSION
Big Data Analytics is becoming accessible and affordable
Public clouds give flexibility and change economics
Hive and Presto provide intuitive and powerful ways to interact with
your data
Sign up for a free trial at Qubole.com
Get access to Hive, Presto, Hadoop, Pig as a Service on
Amazon and Google cloud services
Siva snarayanan@qubole.com / @k2_181
QUESTIONS
Where should data be stored?
What formats are appropriate?
What kinds of processing needs to happen?
What parts are expressible in ANSI-SQL?
How can I plug-in proprietary business logic?
How much compute power is required?
How do I put it all together?

You might also like