Professional Documents
Culture Documents
By: Venkat
Problems with treditional RDBMS
Stores relational data
Works well for a limited number of records
SQL JOINs becomes a bottleneck
Storage space for NULL values
Static Schema
Traditional RDBMS Table - Employee
Column Oriented Database - Dynamic Columns
Problems with HIVE
No DML Operations
Query performance on very large Datasets
Static Schema
Versions
What is HBase?
Built on top of Hadoop
Distributed uses HDFS for storage
Column Oriented Database
Multi-Dimensional(Versions)
Read / Write access to data on HDFS
Storage System
What HBase is NOT?
A SQL Database - No JOINs, No Query Engine,
No Datatypes, No SQL
No Schema
No DBA needed
HBase vs RDBMS
RDBMS HBase
Data Layout Row-oriented Column family oriented
1 Info:age 1273871824184 21
HBase Namespace provide users with a project spacce in which they can create and
manage their own tables.
Table: All tables are members of a namespace, tables with no explicit namespace will
be a member of the default namespace. A table can only be a member of a single
namespace and once defined is permanent.
RSG: A Namespace may optionally have a default region server group. All the tables
created in the Namespace will be members of a namespaces region server group.
A namespace can reference only one region server group afterwhich can no longer
be referenced by other namespaces. This can only be set during namespace creation.
Permissions: A Namespace can have ACLs defined. Write access granted
to a namespace will permit table creation for the given Namespace
This provides tenants their own domain of administration within HBase Cluster
Quota: Quota provides some level of control required to insure that shared resources
are allocated fairly. As a first step we only intend to limit the number of tables and
regions a given namespace may contain.
CLI Commands :
Ways to access HBase Data
HBase Shell
Thift Server
REST Clients
Hadoop ecosystem clients (Hive, Pig, HCatalog...etc)
Hive vs HBase
Hive HBase
Hive is an SQL-like engine that runs HBase is a NoSQL key/value database
MapReduce jobs on Hadoop
Hive can be used for analytical querying HBase can be used for real-time querying
like data collected over a period of time
Provides data summarization and ad-hoc Supports data storage for large tables
querying
Data can even be read and written from Not possible in HBase
Hive to HBase and back again
Where to use HBase (Use Cases)?
To have random, real-time read/write access to Big Data
Fast random access to available data
Variable schema where each row is slightly different
Loading, searching, querying data by Row Key
Retrieve small set of data from billions of records
Where to not to use HBase?
If you plan to scan to entire HBase table or majority of it
If you are not using a filter against rowkey column in your query
Use of "LIKE" against rowkey column does not result good
When creating external tables in Hive against HBase tables,
map the HBase rowkey against a string column in Hive.
If this is not done, rowkey is not used in the query and entire
table is scanned.