Professional Documents
Culture Documents
HADOOP
VS
SQL
SUBMITTED BY
LAXMAN PANDRAMISH
TABLE OF CONTENTS
HADOOP Processing
SQL Processing
Traditional Differences
Practical Differences
Overview
References
This study briefly explains about SQL HADOOP and their differences and
comparison based on execution and outputs
Benefits of Hadoop
Advantages of SQL
SQL Queries can be used to retrieve large amounts of records from a
database quickly and efficiently.
SQL is used to view the data without storing the data into the object.
SQL joins two or more tables and show it as one object to user.
SQL databases use long-established standard, which is being adopted
by ANSI & ISO. Non-SQL databases do not adhere to any clear
standard.
Using standard SQL it is easier to manage database systems without
having to write substantial amount of code.
SQL restricts the access of a table so that nobody can insert the rows
into the table.
HADOOP Processing
Oracle Virtual Box is installed to run Hadoop on the system currently using
In Phase 1 of this project , Data set is being run on the virtual box and its processed using
hadoop pre-installed on the system
Oracle virtual box runs along with the PC with the same network privileges , it has eclipse , java
, and hadoop pre installed
The dataset selected is a portugese bank data set and its being analyzed
Dataset is huge containing 45000 rows approx and it must be organized and analyzed ,
performing analysis on the organized data
So using the data used in excel is being converted to a dataframe so that it can be analyzed and
necessary operations can be performed on it
Scala>hadoop
Scala>hadoop fs -ls
Val df =
sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSche
ma","true").option("delimiter","_").load("/project/final.csv");
Now filtering data based on required conditions (success and failure rates)
Scala>df.groupBy(“age”,”y”).count().sort($”count”.desc).show
Here data is grouped by age , success and failure rate arranged in descending order
Average age of people who say yes and no can be solved by applying aggregation principles
df.groupBy(“y”).agg(avg($age)).show
Scala>df.registerTempTable(“BankDetails”);
Scala>sqlcontext.sql(“Select percentile(balance,0.5) as median , avg(balance) as average from
BankDetails”).show;
SQL Processing
Traditional Differences
Practical Differences
Traditional Differences
Elementary description
FUNCTIONAL PROGRAMMING
Hadoop supports writing functional programming in languages like java, scala,
and python. In RDBMS, there is no possibility of writing UDF and this increases
the complexity of writing SQL. Moreover the data stored in HDFS can be accessed
by all the ecosystem of Hadoop like Hive, Pig, Sqoop and HBase. So, if the UDF is
written it can be used by any of the above mentioned application. It increases
the performance and supportability of the system.
DATA STORAGE
In Hadoop, a basic data can begin in any shape. However, in the long run, it changes
into a key-value pair. Because once the data enters into Hadoop, it is replicated
across multiple nodes in the Hadoop Distributed File System (HDFS). It may seem like
a waste of storage space, but it’s the primary reason behind Hadoop’s massive
scalability.
ARCHITECTURE
Hadoop is meant for Big Data solution, and usually, Hadoop architecture consists of
an unlimited number of servers. Now let’s say that one of those servers gets down
or faces issues while processing data. In this case, the data processing will not hold.
Because every time data gets replicated in each data blocks, hence data processing
continues without any interruption and maintains consistency. As a result, Hadoop
architecture is highly reliable for data.
On the other hand, for SQL you need complete consistency across all the systems
before it releases anything to the user. This is called a two-phase commit.
COST FACTOR
All things considered, big data using Hadoop has number of things for it that make
implementation more cost-effective than companies may realize.
Practical Differences
Usage of Delimiter
In Hadoop while organizing the dataset before execution , a delimiter has
been used to differentiate columns of data and it made creation of data
frame very easy
The usage of delimiter enables the spark cluster to organize and process
data efficiently
Where as in SQL data must be in tabular format in order to get processed
Delimiters are of no use in SQL , tables columns rows typically form a
SQL table and SQL queries
OVERVIEW
REFERENCES