Professional Documents
Culture Documents
Hadoop/HBase Installation
Note: This installation is ONLY for developer machines. It assumes that the Hadoop/HBase
installation is on the same machine where JB will be running. With a multi-node cluster this
installation guide will not work. (These instructions are for linux-Ubuntu OS)
INSTALL JAVA
This step can be skipped if this is done on a reference machine and a JB installation already exists.
You should already have the java 1.6 jdk file in the /root folder of your VM. Follow these
installation instructions:
chmod 755 jdk-6u25-linux-i586.bin
./jdk-6u25-linux-i586.bin
sudo mv jdk1.6.0_25/ /opt
cd /opt
sudo ln -s jdk1.6.0_25 jdk1.6
Create a new profile script in /etc/profile.d/ to set JAVA_HOME to the location of the unpacked
JDK, then change the symlink in /etc/alternatives to point at the new java binary so that its
available to all applications.
sudo vim /etc/profile.d/java.sh
java.sh:
#!/bin/bash
JAVA_HOME=/opt/jdk1.6
PATH=$PATH:$JAVA_HOME/bin
export JAVA_HOME PATH
Run the profile script and update the software alternatives symlinks.
sudo chmod 755 /etc/profile.d/java.sh
source /etc/profile.d/java.sh
move the hadoop and hbase binary into the /opt folder and create symbolic links for those:
mv
mv
cd
ln
ln
hbase-0.94.8/ /opt
hadoop-1.1.2/ /opt
/opt
-s hbase-0.94.8 hbase
-s hadoop-1.1.2 hadoop
make hadoop (group and user) owner of the newly create folders in /opt directory
chown hadoop:hadoop -R hbase-0.94.8/
chown hadoop:hadoop -R hadoop-1.1.2/
hadoop.sh:
#!/bin/bash
HADOOP_HOME=/opt/hadoop
PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_HOME PATH
make hadoop profile script readable and executable for all users,
chmod 755 /etc/profile.d/hadoop.sh
hbase.sh:
#!/bin/bash
HBASE_HOME=/opt/hbase
PATH=$PATH:$HBASE_HOME/bin
export HBASE_HOME PATH
make hbase profile script readable and executable for all users,
chmod 755 /etc/profile.d/hbase.sh
HADOOP CONFIGURATION
Create a folder for the hadoop data. In this example we choose that place to be under the /opt folder
but this is not mandatory. The hadoop data folder can be any place in the system that has sufficient
space. Note, avoid the /tmp folder since Linux OSs do automatic cleaning in these folders.
as root user:
mkdir /opt/hadoop-data
chown hadoop:hadoop /opt/hadoop-data
chmod 755 /opt/hadoop-data
Note that hadoop will not like (and then not start) if its data dir is not own by itself and with the
exact 755 permissions.
As 'hadoop' user, in $HADOOP_HOME/conf/hadoop-env.sh, uncomment and define the
JAVA_HOME variable:
..
export JAVA_HOME=/opt/jdk1.6
..
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In this file you should also configure how many mappers and reducers this node will be processing
as a maximum. The default is only two, which is almost single threaded. If this node will be used
for any real processing, change the '2' for about '200'.
When you decide the real number of mappers/reducers that will run in this node, adjust the
maximum connections for postgres accordingly. Calculate 2 connections per reducer.
To modify the maximum number of connections, edit postgres.conf and edit the property
'max_connections'. Then restart postgres.
HBASE CONFIGURATION
As 'hadoop' user, in $HBASE_HOME/conf/hbase-env.sh, uncomment and define the JAVA_HOME
variable and uncomment the HBASE_MANAGES_ZK variable definition:
..
export JAVA_HOME=/opt/jdk1.6
.
.
.
export HBASE_MANAGES_ZK=true
..
in $HBASE_HOME/conf/hbase-site.xml,
<configuration>
<!-- zoo keeper is also needed for this -->
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.master</name>
<value>localhost:60000</value>
<description>The host and port that the HBase master runs
at.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!-<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
-->
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value>
<description>Root ZNode for HBase in ZooKeeper. All of HBase's
ZooKeeper files that are configured with a relative path will go under this
node. By default, all of HBase's ZooKeeper file path are configured with a
relative path, so they will all go under this directory unless
changed.</description>
</property>
<property>
<name>zookeeper.znode.rootserver</name>
<value>root-region-server</value>
<description>Path to ZNode holding root region location. This is
written by the master and read by clients and region servers. If a relative
path is given, the parent folder will be ${zookeeper.znode.parent}. By default,
this means the root location is stored at /hbase/root-regionserver.</description>
</property>
<!--ZooKeeper config -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/opt/hadoop-data/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.property.tickTime</name>
<value>2000</value>
</property>
<property>
<name>hbase.zookeeper.property.initLimit</name>
<value>10</value>
</property>
<property>
<name>hbase.zookeeper.property.syncLimit</name>
<value>5</value>
</property>
</configuration>
INITIALIZE HDFS
Initialize HDFS by running the commmand:
$HADOOP_HOME/bin/hadoop namenode -format
start
start
start
start
jobtracker
tasktracker
namenode
datanode
to check if all the processes are started execute the 'jps' command.
hadoop@debian:~$ jps
3318 SecondaryNameNode
3506 TaskTracker
3193 DataNode
3090 NameNode
3397 JobTracker
3619 Jps
In the output hadoop processes are JobTracker, NameNode, DataNode, TaskTracker and
SecondaryNameNode.
Two web interfaces will be started that give monitoring options for Hadoop and HDFS.
http://X.Y.Z.Q:50030/jobtracker.jsp - customer status and job monitoring
http://X.Y.Z.Q:50070/dfshealth.jsp - hdfs monitoring
STOPING HADOOP
$HADOOP_HOME/bin/stopall.sh
stop
stop
stop
stop
jobtracker
tasktracker
namenode
datanode
STARTING HBASE
$HBASE_HOME/bin/starthbase.sh
check if all HBase process are started by using the 'jps' command,
hadoop@U10:~$ jps
17890 HMaster
17112 JobTracker
17811 HQuorumPeer
16811 DataNode
17312 TaskTracker
16608 NameNode
17018 SecondaryNameNode
18139 HRegionServer
18256 Jps
After starting both Hadoop and HBase make sure they are working.
LOGS
Check the logs of both HBase and Hadoop and make sure there are no critical exceptions.
HBase logs path: /opt/hbase/logs
Hadoop logs path: /opt/hadoop/logs
WARNING:
Known problem with some Linux distribution is a predefined /etc/hosts entry that starts with
127.0.1.1. When HBase starts the first thing it will do is to insert some nodes into ZooKeeper (ZK)
which contain information about the location of the region servers. When client want to talk to
HBase they talk to the ZK to find out the location of the region servers. The problem here is that
when HBase starts it will do DNS resolution against /etc/hosts for the location of the region server
and it will update the ZK node with that information instead of the configured data. Now, if we
leave the 127.0.1.1 entry in /etc/hosts that HBase may use this entry and include it in ZK. When
clients query ZK about information they will receive this information and they will not be able to
connect to the region server. This problem is much more visible when we access the Hadoop/HBase
from a different machine that the one with the Hadoop/HBase installation. The usual solution is to
remove this line from /etc/hosts and restart everything. For more details read Why does HBase care
about /etc/hosts?
/etc/hosts
The first entry in /etc/hosts should be similar to below:
127.0.0.1 localhost domain_name_or_machine_name
HDFS Sanity Test:
hadoop dfs -ls /
Note, if the process hangs and is not listing the tables (or saying that there are not tables) than most
likely HDFS is not available and something is wrong.
As part of sanity testing a good practice is to check the logs to see if there is some start up errors.
The logs ca be found at:
$HADOOP_HOME/logs
TROUBLE SHOOTING
PID FILES
Files containing process pids are stored in /tmp. If you kill a hbase process you will have to delete
the files in order for it to restart from the command line.