You are on page 1of 4

Everest

Scaling to Petabytes

Yahoo!
May 2008
Everest Architecture

Segment Scripts / • Massively Parallel (Tens of PB)


PgAdmin
Manager Apps – Commodity Clusters
PERL/ – Multi-tier scalability
Clients Ruby.DBI
ODBC ADO.NET PQLib
– Distributed Columnar Storage
PostgreSQL Lib
• Smart
PostgreSQL Server – Optimized compression
Everest Extensions – Parallel Vector Query Processing
Segmentation
QP –
Distributed QP
Platform Query and Storage optimizations
Query
Trans Proxy Mgmt Proxy LSM Proxy
Shared – Query Expression and Columnar caching
Memory
Server Mgmt Logical Storage Trans • Leverage PostgreSQL
Services ManagerVolume Server
– Tools and Connectivity (ODBC)
Storage Storage Storage
Provider Proxy Cache – extensibility
Asynchronous Communications – UDF & UDAF framework
Node Storage Manager • Inexpensive
– COTS
Storage Asynchronous Communications
Storage Storage Storage
Server Provider Proxy Cache

Volume Storage Manager


Chunk Volume
Storage Services
Storage Metadata

Volume
Volume
2
Performance and Scale

• Proven Petabytes scale in production


– Approaching 2 PB, projected to grow > 30 PB by 2009
– Largest table: 3.5 Trillion rows (time partitioned)

• 10x Price-Performance relative to commercial systems

Data size Everest Vendor A Vendor B Performance comparison


(min) (min) (min)
500
90 TB 177 414 325
400

Response Time (min)


(600 B rows)
Vendor A
300
30 TB 60 95 91 Vendor B
200
(200 B rows) Everest
100
HW Cost 250 1200 1200 0
(1 PB) 30 TB 90 TB
Data size

3
Everest Performance Advantages

• Source of Performance and Scale


– Distributed Compressed Columnar Storage
– Highly Parallel and Asynchronous
• Multi threaded Query Execution as well as Storage
– Vector Query Processing
– Multi-level data partitioning and query partitioning
– Cluster-level Compressed Columnar caching
– Query expression caching
– Yahoo! specific language extensions and UDF & UDAF

You might also like