You are on page 1of 2

What is big data?

How Hadoop solves


Distributed file system
Ecosytem of tools
solving big data using different tools
What is big data?
Do we have big data problem ?
If yes then what is the solution to tackle this?
How to use hadoop core components to solve big data
Big data:
3Vs
Volume , Velocity and Variability are the factors that needs to be understood.
Traditional databases cannot help with big data.
Instead of buying one single high power machine , Google went for distributed co
mputing. A set of computers
is called cluster and each one is called node. Ease of replacement.
Google created Google File System (GFS).
GFS working:
Break the file into chunks.
Distribute these chunks across nodes of a cluster.
These chunks are replicated to protect the data when the node is lost.
Big Table:
Big Table is a database system that uses GFS to store and retreive data.
Big table maps the data using Row key, Column Key and TimeStamp so that same inf
ormation can be captured
overtime without overwriting the existing entries.
Rows are partitioned as Tablets and they are distributed across files.
Parallel Processing Paradigm called MapReduce is used to process the data stored
in GFS.
Map Reduce has two steps in the process:
1. Mapping Step:
Data is logically split.
Map function is applied to each of this split
Distributed File System:
Handles locating files and transporting data.
Since the data is distributed across multiple nodes, DFS uses DataLocator to sto
re information about
data location . Much like the iNode in the local file system , DataLocator will
point to the data located
in DFS.
Data Management in MapR-FS
MapR-FS is the distributed FS .Supports all the funcs of LFS.R/W access, local/r
emote mirroring , grow space.
Each node may have more than 1 hard disks.

In MapR-FS , disks are combined together into groups called storage pools.By def
ault storage pool consists of
three disks.
When data is written to storage pool , it is striped across three disks increasi
ng write speed.Each node consists of one or more storage pools.All storage pools
from all nodes is total storage space available to MapR-FS
MapR-FS writes data into logical units called containers. A container has defaul
t max size of 32GB.A storage pools have many containers.A data in container is r
eplicated to protect.

You might also like