Hadoop Vs SQL Processing

1
HADOOP
VS
SQL
A COMPARITIVE INDEPENDENT STUDY
SUBMITTED BY
LAXMAN PANDRAMISH
INDEPENDENT STUDY HADOOP VS SQL

2
TABLE OF CONTENTS
What’s the Study About?
HADOOP – Open Source Project
SQL – Structured Query Language
HADOOP Processing
SQL Processing
Traditional Differences
Practical Differences
Overview
References

3
WHATS THE STUDY ABOUT?
Hadoop is replacing RDBM in most of the cases, especially in data

warehousing, business intelligence reporting, and other analytical
processing. It becomes a real challenge to perform complex reporting in
these applications as the size of the data grows exponentially. Along with
that, there is customers demand complex analysis and reporting on
those data. So, Hadoop vs SQL database is a pertaining question when
you are going to select the data storage and processing framework for
your next project.
Many people are concerned about this question : IS SQL BETTER? Or IS

HADOOP BETTER?
This study briefly explains about SQL HADOOP and their differences and
comparison based on execution and outputs
This study compares and generalizes traditional differences , practical

differences based on a real time project example executed both in
Hadoop and SQL procedures

4
HADOOP – OPEN SOURCE PROJECT
Apache Hadoop is a collection of open-source software utilities that

facilitate using a network of many computers to solve problems involving
massive amounts of data and computation. It provides a software
framework for distributed storage and processing of Bigdata using map-
reducing techniques
The base Apache Hadoop framework is composed of the following

modules:
• Hadoop Common – contains libraries and utilities needed by other

Hadoop modules;
• Hadoop Distributed File System (HDFS) – a distributed file-system
that stores data on commodity machines, providing very high
aggregate bandwidth across the cluster;
• Hadoop YARN – introduced in 2012 is a platform responsible for
managing computing resources in clusters and using them for
scheduling users' applications and
• Hadoop MapReduce – an implementation of the MapReduce
programming model for large-scale data processing.

5
Apache Hadoop's MapReduce and HDFS components were inspired

by Google papers on their Map Reduce and Google File System
The Hadoop framework itself is mostly written in the Java Programming
Language, with some native code in C and Command Line utilities
written as shell scripts. Though MapReduce Java code is common, any
programming language can be used with "Hadoop Streaming" to
implement the "map" and "reduce" parts of the user's program.
Benefits of Hadoop
• Scalability and Performance – distributed processing of data local to

each node in a cluster enables Hadoop to store, manage, process and
analyze data at petabyte scale.
• Reliability – large computing clusters are prone to failure of individual
nodes in the cluster. Hadoop is fundamentally resilient – when a node
fails processing is re-directed to the remaining nodes in the cluster
and data is automatically re-replicated in preparation for future node
failures.
• Flexibility – unlike traditional relational database management
systems, you don’t have to created structured schemas before storing
data. You can store data in any format, including semi-structured or
unstructured formats, and then parse and apply schema to the data
when read.
• Low Cost – unlike proprietary software, Hadoop is open source and
runs on low-cost commodity hardware.

6
SQL – STRUCTURED QUERY LANGUAGE
SQL (Structured Query Language) is a domain specific language used in

programming and designed for managing data held in a Relational
Database Management System (RDBMS), or for stream processing in
a Relational Database stream management system (RDSMS)
Originally based upon relational algebra and tuple relational calculus,

SQL consists of many types of statements, which may be informally
classed as sublanguages, commonly: a data query language (DQL), a data
definition language (DDL), a data control language (DCL), and a data
manipulation language (DML) The scope of SQL includes data query, data
manipulation (insert, update and delete), data definition (schema
creation and modification), and data access control.
The SQL language is subdivided into several language elements,

including:
• Clauses, which are constituent components of statements and

queries. (In some cases, these are optional.)
• Expressions, which can produce either scalar values, or tables
consisting of columns and rows of data
• Predicates, which specify conditions that can be evaluated to SQL
three-valued logic (true/false/unknown) or Boolean Truth values and
are used to limit the effects of statements and queries, or to change
program flow.
• Queries, which retrieve the data based on specific criteria. This is an
important element of SQL.

7
• Statements, which may have a persistent effect on schemata and

data, or may control transactions, program flow, connections,
sessions, or diagnostics.
Advantages of SQL
SQL Queries can be used to retrieve large amounts of records from a
database quickly and efficiently.
SQL is used to view the data without storing the data into the object.
SQL joins two or more tables and show it as one object to user.
SQL databases use long-established standard, which is being adopted
by ANSI & ISO. Non-SQL databases do not adhere to any clear
standard.
Using standard SQL it is easier to manage database systems without
having to write substantial amount of code.
SQL restricts the access of a table so that nobody can insert the rows
into the table.

8
HADOOP Processing
To differentiate Hadoop and SQL processing a project named Banking

Analysis has been selected
It has huge Excel Data set with approximately 45000 rows , resolved in
both Hadoop and SQL platforms
Data set showing bank data details

9
Hadoop processing has been on ORACLE VM , with spark being initialized
Oracle Virtual Box Installation
Oracle Virtual Box is installed to run Hadoop on the system currently using
In Phase 1 of this project , Data set is being run on the virtual box and its processed using
hadoop pre-installed on the system
Oracle Virtual Box
Oracle virtual box runs along with the PC with the same network privileges , it has eclipse , java
, and hadoop pre installed

10
Data set Analyzation
Start the virtual box
The dataset selected is a portugese bank data set and its being analyzed
Dataset is huge containing 45000 rows approx and it must be organized and analyzed ,
performing analysis on the organized data
Data Frame Creation
The data which is in excel sheet must be organized before analyzing it
So using the data used in excel is being converted to a dataframe so that it can be analyzed and
necessary operations can be performed on it

11
To create data frame first we must start hadoop on the terminal
Scala>hadoop
And then copy the file which is in local to hadoop cluster
Scala> hadoop fs mkdir project
Scala> hadoop fs -copyFromLocal final.csv project
Scala>hadoop fs -ls
Then create data frame by initiating databricks spark cluster
Scala>spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
Code for data frame creation
Val df =
sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSche
ma","true").option("delimiter","_").load("/project/final.csv");

12
After successful creation of dataframe , data is organized and displayed below
Data Frame creation
Now filtering data based on required conditions (success and failure rates)
Val success = df.filter($”poutcome”===”success”)

Val s = success.count();
Val r = df.count(); [ total count]
Val successrate = r/s
Val failure = df.filter($”poutcome”===”failure”)

Val f = failure.count();

13
Val r = same as above value
Val failurerate = r/f
The success and failure rates are shown below
Success and failure rates

14
Featured Engineering on the data set
Scala>df.groupBy(“age”,”y”).count().sort($”count”.desc).show
Here data is grouped by age , success and failure rate arranged in descending order
Average age of people who say yes and no can be solved by applying aggregation principles
df.groupBy(“y”).agg(avg($age)).show

15
Different Processings of Data
Impact of marriage and age on the dataset
Val marriage = df.groupBy(“y”,”marital”).agg(avg($”age”)).show

16
Creating Temporary tables and calculation of median
Scala>df.registerTempTable(“BankDetails”);
Scala>sqlcontext.sql(“Select percentile(balance,0.5) as median , avg(balance) as average from
BankDetails”).show;

17
SQL Processing
The same dataset has been processed in SQL server

Starting step was to transfer the data from Excel sheet to SQL
server by use of import export wizard by microsoft
Data transfer must be initiated by installing Access Database
Engine
Data transferred is directly moved to table by the engine and
must be made organized by use of some commands

18
First database is created and then table is created by import export

wizard
Steps to transfer data from Excel to SQL

19

20

21

22
Processing the data by SQL queries

23

24

25

26
Hadoop Vs SQL Comparison Table

Characteristics Traditional SQL Hadoop
Data Size Gigabytes Petabytes
Access Interactive & Batch Batch
Read and Write – Write once, read Multiple

Updates
Multiple times times
Structure Static Schema Dynamic Schema
Integrity High Low
Scaling Non-Linear Linear
Above written are Basic differences
Elementary description
FUNCTIONAL PROGRAMMING
Hadoop supports writing functional programming in languages like java, scala,
and python. In RDBMS, there is no possibility of writing UDF and this increases
the complexity of writing SQL. Moreover the data stored in HDFS can be accessed
by all the ecosystem of Hadoop like Hive, Pig, Sqoop and HBase. So, if the UDF is
written it can be used by any of the above mentioned application. It increases
the performance and supportability of the system.

27
DATA STORAGE
A crucial principle of relational databases is data stores in tables containing relational

structure characterized by defined row and columns. Moreover, data is stored in
interrelated tables
In Hadoop, a basic data can begin in any shape. However, in the long run, it changes
into a key-value pair. Because once the data enters into Hadoop, it is replicated
across multiple nodes in the Hadoop Distributed File System (HDFS). It may seem like
a waste of storage space, but it’s the primary reason behind Hadoop’s massive
scalability.
ARCHITECTURE
Hadoop is meant for Big Data solution, and usually, Hadoop architecture consists of
an unlimited number of servers. Now let’s say that one of those servers gets down
or faces issues while processing data. In this case, the data processing will not hold.
Because every time data gets replicated in each data blocks, hence data processing
continues without any interruption and maintains consistency. As a result, Hadoop
architecture is highly reliable for data.
On the other hand, for SQL you need complete consistency across all the systems
before it releases anything to the user. This is called a two-phase commit.
COST FACTOR
Cost-effectiveness is always a concern for companies looking to adopt new

technologies. When implementing Hadoop, companies need to do their effort to
make sure that the realized benefits of a Hadoop deployment outweigh the costs.
Otherwise it would be best to stick with a traditional database to meet data storage
and analytics needs.
All things considered, big data using Hadoop has number of things for it that make
implementation more cost-effective than companies may realize.

28
The 3 main differences found are
Usage of Delimiter
In Hadoop while organizing the dataset before execution , a delimiter has
been used to differentiate columns of data and it made creation of data
frame very easy
The usage of delimiter enables the spark cluster to organize and process
data efficiently
Where as in SQL data must be in tabular format in order to get processed
Delimiters are of no use in SQL , tables columns rows typically form a
SQL table and SQL queries

29
Offline and Online Processing

Hadoop is designed for offline processing and analysis of large-scale
data. It doesn’t work for random reading and writing of a few records,
which is the type of load for online transaction processing. In fact, as of
this writing (and in the foreseeable future), Hadoop is best used as a
write once , read-many-times type of data store. In this aspect it’s same
as data warehouses in the SQL world.
While processing the datasets , Spark didn’t function while system is in
offline mode , server was not initiated when the network isn’t connected
SQL was working even in offline mode , the reason for Spark isn’t
functioning might be the virtual machine not working due to lack of
network
Functional programming vs Queries

SQL is fundamentally a high-level declarative language. You query data
by stating the result you want and let the database engine figure out how
to derive it. Under MapReduce you specify the actual steps in processing
the data, which is more analogous to an execution plan for a SQL engine
Under SQL you have query statements; under MapReduce you have
scripts and codes. MapReduce allows you to process data in a more
general fashion than SQL queries. For example, you can build complex
statistical models from your data or reformat your image data. SQL is not
well designed for such tasks.
SQL had direct and simple queries to process and extract data and also
store data
While Hadoop had some complex programming statements compared
to SQL and also SQL is user-friendly

30
OVERVIEW
Overall, Hadoop steps ahead of the traditional SQL in terms of cost,

time, performance, reliability, supportability and availability of data
to the very large user group. In order to efficiently handle the
tremendous amount of data generated every day, Hadoop
framework helps in timely capturing, storing, processing, filtering
and finally storing in it in a centralized place

31
REFERENCES

32

Hadoop Vs SQL Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Vs SQL Processing

Uploaded by

Copyright:

Available Formats

1

A COMPARITIVE INDEPENDENT STUDY

INDEPENDENT STUDY HADOOP VS SQL

What’s the Study About?

HADOOP – Open Source Project

SQL – Structured Query Language

INDEPENDENT STUDY HADOOP VS SQL

WHATS THE STUDY ABOUT?

Hadoop is replacing RDBM in most of the cases, especially in data

Many people are concerned about this question : IS SQL BETTER? Or IS

This study compares and generalizes traditional differences , practical

INDEPENDENT STUDY HADOOP VS SQL

HADOOP – OPEN SOURCE PROJECT

Apache Hadoop is a collection of open-source software utilities that

The base Apache Hadoop framework is composed of the following

• Hadoop Common – contains libraries and utilities needed by other

INDEPENDENT STUDY HADOOP VS SQL

Apache Hadoop's MapReduce and HDFS components were inspired

• Scalability and Performance – distributed processing of data local to

INDEPENDENT STUDY HADOOP VS SQL

SQL – STRUCTURED QUERY LANGUAGE

SQL (Structured Query Language) is a domain specific language used in

Originally based upon relational algebra and tuple relational calculus,

The SQL language is subdivided into several language elements,

• Clauses, which are constituent components of statements and

INDEPENDENT STUDY HADOOP VS SQL

• Statements, which may have a persistent effect on schemata and

INDEPENDENT STUDY HADOOP VS SQL

To differentiate Hadoop and SQL processing a project named Banking

Data set showing bank data details

INDEPENDENT STUDY HADOOP VS SQL

Hadoop processing has been on ORACLE VM , with spark being initialized

Oracle Virtual Box Installation

Oracle Virtual Box

INDEPENDENT STUDY HADOOP VS SQL

Data set Analyzation

Start the virtual box

Data Frame Creation

The data which is in excel sheet must be organized before analyzing it

INDEPENDENT STUDY HADOOP VS SQL

To create data frame first we must start hadoop on the terminal

And then copy the file which is in local to hadoop cluster

Scala> hadoop fs mkdir project

Scala> hadoop fs -copyFromLocal final.csv project

Then create data frame by initiating databricks spark cluster

Scala>spark-shell --packages com.databricks:spark-csv_2.10:1.4.0

Code for data frame creation

INDEPENDENT STUDY HADOOP VS SQL

After successful creation of dataframe , data is organized and displayed below

Data Frame creation

Val success = df.filter($”poutcome”===”success”)

Val successrate = r/s

Val failure = df.filter($”poutcome”===”failure”)

INDEPENDENT STUDY HADOOP VS SQL

Val r = same as above value

Val failurerate = r/f

The success and failure rates are shown below

Success and failure rates

INDEPENDENT STUDY HADOOP VS SQL

Featured Engineering on the data set

INDEPENDENT STUDY HADOOP VS SQL

Different Processings of Data

Impact of marriage and age on the dataset