You are on page 1of 11

The Enterprise Open Source Billing System

Hadoop/HBase Installation
Note: This installation is ONLY for developer machines. It assumes that the Hadoop/HBase
installation is on the same machine where JB will be running. With a multi-node cluster this
installation guide will not work. (These instructions are for linux-Ubuntu OS)

INSTALL JAVA
This step can be skipped if this is done on a reference machine and a JB installation already exists.
You should already have the java 1.6 jdk file in the /root folder of your VM. Follow these
installation instructions:
chmod 755 jdk-6u25-linux-i586.bin
./jdk-6u25-linux-i586.bin
sudo mv jdk1.6.0_25/ /opt
cd /opt
sudo ln -s jdk1.6.0_25 jdk1.6

Create a new profile script in /etc/profile.d/ to set JAVA_HOME to the location of the unpacked
JDK, then change the symlink in /etc/alternatives to point at the new java binary so that its
available to all applications.
sudo vim /etc/profile.d/java.sh

java.sh:
#!/bin/bash
JAVA_HOME=/opt/jdk1.6
PATH=$PATH:$JAVA_HOME/bin
export JAVA_HOME PATH

Run the profile script and update the software alternatives symlinks.
sudo chmod 755 /etc/profile.d/java.sh
source /etc/profile.d/java.sh

sudo update-alternatives --install "/usr/bin/java" "java"


"/opt/jdk1.6/bin/java" 1
sudo update-alternatives --set java /opt/jdk1.6/bin/java

INSTALL HADOOP AND HBASE


CREATE HADOOP GROUP/USER
sudo groupadd hadoop
sudo useradd hadoop -m -s /bin/bash -g hadoop
sudo passwd hadoop

change the password to 'hadoop'

INSTALL HADOOP/HBASE BINARIES


Be very careful about hadoop and hbase version. Specific versions of hadoop work only with
specific versions of hbase.
Download the binaries for hadoop and hbase into /root folder.
as root user:
cd ~
wget http://archive.apache.org/dist/hadoop/core/hadoop-1.1.2/hadoop-1.1.2bin.tar.gz
tar zxvf hadoop-1.1.2-bin.tar.gz
wget http://archive.apache.org/dist/hbase/hbase-0.94.8/hbase-0.94.8.tar.gz
tar zxvf hbase-0.94.8.tar.gz

move the hadoop and hbase binary into the /opt folder and create symbolic links for those:
mv
mv
cd
ln
ln

hbase-0.94.8/ /opt
hadoop-1.1.2/ /opt
/opt
-s hbase-0.94.8 hbase
-s hadoop-1.1.2 hadoop

make hadoop (group and user) owner of the newly create folders in /opt directory
chown hadoop:hadoop -R hbase-0.94.8/
chown hadoop:hadoop -R hadoop-1.1.2/

HADOOP PROFILE SCRIPT


as root user:
vim /etc/profile.d/hadoop.sh

hadoop.sh:
#!/bin/bash
HADOOP_HOME=/opt/hadoop
PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_HOME PATH

make hadoop profile script readable and executable for all users,
chmod 755 /etc/profile.d/hadoop.sh

HBASE PROFILE SCRIPT


vim /etc/profile.d/hbase.sh

hbase.sh:
#!/bin/bash
HBASE_HOME=/opt/hbase
PATH=$PATH:$HBASE_HOME/bin
export HBASE_HOME PATH

make hbase profile script readable and executable for all users,
chmod 755 /etc/profile.d/hbase.sh

HADOOP USER ENV CONFIG


as hadoop user:
cd ~
vim .bashrc

at the bottom of the file add the following lines:


source /etc/profile.d/java.sh
source /etc/profile.d/hadoop.sh
source /etc/profile.d/hbase.sh

SSH KEYS (HADOOP USER)


Generate ssh keys (for hadoop user) to be able ssh into the machine without password.
as hadoop user:
cd ~
ssh-keygen -t rsa
cd .ssh
cat id_rsa.pub > authorized_keys
chmod 600 authorized_keys

use empty pass phrases.


(The local machine also requires an sshd server running, which may need to be installed if not
already installed. (openssh-server)

HADOOP CONFIGURATION
Create a folder for the hadoop data. In this example we choose that place to be under the /opt folder
but this is not mandatory. The hadoop data folder can be any place in the system that has sufficient
space. Note, avoid the /tmp folder since Linux OSs do automatic cleaning in these folders.
as root user:
mkdir /opt/hadoop-data
chown hadoop:hadoop /opt/hadoop-data
chmod 755 /opt/hadoop-data

Note that hadoop will not like (and then not start) if its data dir is not own by itself and with the
exact 755 permissions.
As 'hadoop' user, in $HADOOP_HOME/conf/hadoop-env.sh, uncomment and define the
JAVA_HOME variable:
..
export JAVA_HOME=/opt/jdk1.6
..

As 'hadoop' user, in $HADOOP_HOME/conf/core-site.xml:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

As 'hadoop' user, in $HADOOP_HOME/conf/hdfs-site.xml:


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!-<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop-data</value>
<final>true</final>
</property>
-->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-data</value>
</property>
</configuration>

As 'hadoop' user, in $HADOOP_HOME/conf/mapred-site.xml:


<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

In this file you should also configure how many mappers and reducers this node will be processing
as a maximum. The default is only two, which is almost single threaded. If this node will be used
for any real processing, change the '2' for about '200'.
When you decide the real number of mappers/reducers that will run in this node, adjust the
maximum connections for postgres accordingly. Calculate 2 connections per reducer.
To modify the maximum number of connections, edit postgres.conf and edit the property
'max_connections'. Then restart postgres.

As 'hadoop' user, in $HADOOP_HOME/conf/log4j.properties, change the root logging level


threshold to DEBUG,
.
hadoop.root.logger=DEBUG,console
.

HBASE CONFIGURATION
As 'hadoop' user, in $HBASE_HOME/conf/hbase-env.sh, uncomment and define the JAVA_HOME
variable and uncomment the HBASE_MANAGES_ZK variable definition:
..
export JAVA_HOME=/opt/jdk1.6
.
.
.
export HBASE_MANAGES_ZK=true
..

in $HBASE_HOME/conf/hbase-site.xml,
<configuration>
<!-- zoo keeper is also needed for this -->
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.master</name>
<value>localhost:60000</value>
<description>The host and port that the HBase master runs
at.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!-<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
-->
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value>
<description>Root ZNode for HBase in ZooKeeper. All of HBase's
ZooKeeper files that are configured with a relative path will go under this
node. By default, all of HBase's ZooKeeper file path are configured with a
relative path, so they will all go under this directory unless
changed.</description>

</property>
<property>
<name>zookeeper.znode.rootserver</name>
<value>root-region-server</value>
<description>Path to ZNode holding root region location. This is
written by the master and read by clients and region servers. If a relative
path is given, the parent folder will be ${zookeeper.znode.parent}. By default,
this means the root location is stored at /hbase/root-regionserver.</description>
</property>
<!--ZooKeeper config -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/opt/hadoop-data/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.property.tickTime</name>
<value>2000</value>
</property>
<property>
<name>hbase.zookeeper.property.initLimit</name>
<value>10</value>
</property>
<property>
<name>hbase.zookeeper.property.syncLimit</name>
<value>5</value>
</property>
</configuration>

INITIALIZE HDFS
Initialize HDFS by running the commmand:
$HADOOP_HOME/bin/hadoop namenode -format

output should look similar to this:


/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:
host = HP610/127.0.1.1
STARTUP_MSG:
args = [-format]
STARTUP_MSG:
version = 1.1.1
STARTUP_MSG:
build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108;
compiled by 'hortonfo' on Mon Nov 19 10:48:11 UTC 2012
************************************************************/
13/01/02 19:07:47 INFO util.GSet: VM type
= 64-bit
13/01/02 19:07:47 INFO util.GSet: 2% max memory = 17.77875 MB
13/01/02 19:07:47 INFO util.GSet: capacity
= 2^21 = 2097152 entries
13/01/02 19:07:47 INFO util.GSet: recommended=2097152, actual=2097152
13/01/02 19:07:48 INFO namenode.FSNamesystem: fsOwner=hadoop
13/01/02 19:07:48 INFO namenode.FSNamesystem: supergroup=supergroup
13/01/02 19:07:48 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/01/02 19:07:48 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/01/02 19:07:48 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/01/02 19:07:48 INFO namenode.NameNode: Caching file names occuring more than
10 times
13/01/02 19:07:48 INFO common.Storage: Image file of size 112 saved in 0
seconds.
13/01/02 19:07:49 INFO namenode.FSEditLog: closing edit log: position=4,
editlog=/tmp/hadoop-hadoop/dfs/name/current/edits
13/01/02 19:07:49 INFO namenode.FSEditLog: close success: truncate to 4,
editlog=/tmp/hadoop-hadoop/dfs/name/current/edits
13/01/02 19:07:49 INFO common.Storage: Storage directory /tmp/hadoophadoop/dfs/name has been successfully formatted.
13/01/02 19:07:49 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at HP610/127.0.1.1
************************************************************/

STARTING/STOPING HADOOP and HBASE


Always start hadoop before hbase. Hbase is configured to work on top of HDFS which is started
and running along with Hadoop. Both hadoop and hbase come with scripts that will star/stop
hadoop and hbase.
STARTING HADOOP
$HADOOP_HOME/bin/start-all.sh

or alternativly start each process manually


hadoop-daemon.sh
hadoop-daemon.sh
hadoop-daemon.sh
hadoop-daemon.sh

start
start
start
start

jobtracker
tasktracker
namenode
datanode

to check if all the processes are started execute the 'jps' command.
hadoop@debian:~$ jps
3318 SecondaryNameNode
3506 TaskTracker
3193 DataNode
3090 NameNode
3397 JobTracker
3619 Jps

In the output hadoop processes are JobTracker, NameNode, DataNode, TaskTracker and
SecondaryNameNode.
Two web interfaces will be started that give monitoring options for Hadoop and HDFS.
http://X.Y.Z.Q:50030/jobtracker.jsp - customer status and job monitoring
http://X.Y.Z.Q:50070/dfshealth.jsp - hdfs monitoring
STOPING HADOOP
$HADOOP_HOME/bin/stopall.sh

or alternativly stop each process manually


hadoop-daemon.sh
hadoop-daemon.sh
hadoop-daemon.sh
hadoop-daemon.sh

stop
stop
stop
stop

jobtracker
tasktracker
namenode
datanode

STARTING HBASE
$HBASE_HOME/bin/starthbase.sh

check if all HBase process are started by using the 'jps' command,
hadoop@U10:~$ jps
17890 HMaster
17112 JobTracker
17811 HQuorumPeer
16811 DataNode
17312 TaskTracker
16608 NameNode
17018 SecondaryNameNode
18139 HRegionServer
18256 Jps

In the output, HBase processes are HMaster, HQuorumPeer and HRegionServer.


STOPING HBASE
$HBASE_HOME/bin/stophbase.sh

After starting both Hadoop and HBase make sure they are working.
LOGS
Check the logs of both HBase and Hadoop and make sure there are no critical exceptions.
HBase logs path: /opt/hbase/logs
Hadoop logs path: /opt/hadoop/logs
WARNING:
Known problem with some Linux distribution is a predefined /etc/hosts entry that starts with
127.0.1.1. When HBase starts the first thing it will do is to insert some nodes into ZooKeeper (ZK)
which contain information about the location of the region servers. When client want to talk to
HBase they talk to the ZK to find out the location of the region servers. The problem here is that
when HBase starts it will do DNS resolution against /etc/hosts for the location of the region server
and it will update the ZK node with that information instead of the configured data. Now, if we
leave the 127.0.1.1 entry in /etc/hosts that HBase may use this entry and include it in ZK. When
clients query ZK about information they will receive this information and they will not be able to
connect to the region server. This problem is much more visible when we access the Hadoop/HBase
from a different machine that the one with the Hadoop/HBase installation. The usual solution is to
remove this line from /etc/hosts and restart everything. For more details read Why does HBase care
about /etc/hosts?
/etc/hosts
The first entry in /etc/hosts should be similar to below:
127.0.0.1 localhost domain_name_or_machine_name
HDFS Sanity Test:
hadoop dfs -ls /

it should list the root folder content of the HDFS


HBASE Sanity Test:
Start HBase shell script, with:
hbase shell

Try to list all the tables, with:


list

Note, if the process hangs and is not listing the tables (or saying that there are not tables) than most
likely HDFS is not available and something is wrong.
As part of sanity testing a good practice is to check the logs to see if there is some start up errors.
The logs ca be found at:
$HADOOP_HOME/logs

TROUBLE SHOOTING
PID FILES
Files containing process pids are stored in /tmp. If you kill a hbase process you will have to delete
the files in order for it to restart from the command line.

You might also like