You are on page 1of 7

HADOOP & MAP REDUCE

BY
K.KARTHIKEYAN

HADOOP & MAP REDUCE

Hadoop Distributed File System is designed to


store very large data sets reliably, and to stream
those data sets at high bandwidth to user
applications. HDFS stores file system metadata
and application data separately.
HDFS namespaces is a hierarchy of files and
directories. File and directories are represented
on the NameNode by inodes, which record
attributes like permissions, modification and
access times, namespace and disk space quotas.
The file content is split into large blocks and each
block of the file is independently replicated at
multiple Datanodes.

HDFS keeps the entire namespace in RAM. The nodes data


and the list of blocks belonging to each file comprise the metadata of the name system called the image.
The persistent record of the image stored in the local hosts
native files system is called a checkpoint.

MAP REDUCE
Mapreduce is a programming model and software framework
first developed by Google. Intended to facilitate and simplify
the processing of vast amounts of data in parallel on large
clusters of commodity hardware in a reliable fault-tolerant
manner.
MapReduce Characteristics:
Very large scale data
Write once and read many data.
Map and reduce the main operation
All the map should be completed before reduce operation
starts.

Map and reduce operations are typically performed by the


same physical processor.
Number of map tasks and reduce tasks are configurable.
Operations are provisioned near the data.
Commodity hardware and storage.
Runtime takes care of splitting and moving data for operations.

Input: This is the input data / file to be processed.


Split: Hadoop splits the incoming data into smaller pieces
called "splits".
Map: In this step, MapReduce processes each split according
to the logic defined in map() function. Each mapper works on
each split at a time. Each mapper is treated as a task and
multiple tasks are executed across different TaskTrackers and
coordinated by the JobTracker.
Combine: This is an optional step and is used to improve the
performance by reducing the amount of data transferred across
the network. Combiner is the same as the reduce step and is
used for aggregating the output of the map() function before it
is passed to the subsequent steps.
Shuffle & Sort: In this step, outputs from all the mappers is
shuffled, sorted to put them in order, and grouped before
sending them to the next step.

Reduce: This step is used to aggregate the outputs of mappers


using the reduce() function. Output of reducer is sent to the
next and final step. Each reducer is treated as a task and
multiple tasks are executed across different TaskTrackers and
coordinated by the JobTracker.
Output: Finally the output of reduce step is written to a file in
HDFS.

You might also like