How to Install and Configure Apache Hadoop on a Single Node in CentOS 7

How to Install and Configure Apache Hadoop on a Single Node in CentOS 7

 

Apache Hadoop is an Open Source framework build for distributed Big Data storage and processing data across computer clusters. The project is based on the following components:

  1. Hadoop Common – it contains the Java libraries and utilities needed by other Hadoop modules.
  2. HDFS – Hadoop Distributed File System – A Java based scalable file system distributed across multiple nodes.
  3. MapReduce – YARN framework for parallel big data processing.
  4. Hadoop YARN: A framework for cluster resource management.

Install Hadoop in CentOS 7

This article will guide you on how you can install Apache Hadoop on a single node cluster in CentOS 7 (also works for RHEL 7 and Fedora 23+ versions). This type of configuration is also referenced as Hadoop Pseudo-Distributed Mode.

Step 1: Install Java on CentOS 7

1. Before proceeding with Java installation, first login with root user or a user with root privileges setup your machine hostname with the following command.

# hostnamectl set-hostname master

Set Hostname in CentOS 7

Also, add a new record in hosts file with your own machine FQDN to point to your system IP Address.

# vi /etc/hosts

Add the below line:

192.168.1.41 master.hadoop.lan

Set Hostname in /etc/hosts File

Replace the above hostname and FQDN records with your own settings.

2. Next, go to Oracle Java download page and grab the latest version of Java SE Development Kit 8 on your system with the help of curl command:

# curl -LO -H "Cookie: oraclelicense=accept-securebackup-cookie" “http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.rpm”

Download Java SE Development Kit 8

3. After the Java binary download finishes, install the package by issuing the below command:

# rpm -Uvh jdk-8u92-linux-x64.rpm

Install Java in CentOS 7

Step 2: Install Hadoop Framework in CentOS 7

4. Next, create a new user account on your system without root powers which we’ll use it for Hadoop installation path and working environment. The new account home directory will reside in /opt/hadoop directory.

# useradd -d /opt/hadoop hadoop
# passwd hadoop

5. On the next step visit Apache Hadoop page in order to get the link for the latest stable version and download the archive on your system.

# curl -O http://apache.javapipe.com/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz 

Download Hadoop Package

6. Extract the archive the copy the directory content to hadoop account home path. Also, make sure you change the copied files permissions accordingly.

#  tar xfz hadoop-2.7.2.tar.gz
# cp -rf hadoop-2.7.2/* /opt/hadoop/
# chown -R hadoop:hadoop /opt/hadoop/

Extract-and Set Permissions on Hadoop

7. Next, login with hadoop user and configure Hadoop and Java Environment Variables on your system by editing the .bash_profile file.

# su - hadoop
$ vi .bash_profile

Append the following lines at the end of the file:

## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Configure Hadoop and Java Environment Variables

8. Now, initialize the environment variables and check their status by issuing the below commands:

$ source .bash_profile
$ echo $HADOOP_HOME
$ echo $JAVA_HOME

Initialize Linux Environment Variables

9. Finally, configure ssh key based authentication for hadoop account by running the below commands (replace the hostname or FQDN against the ssh-copy-id command accordingly).

Also, leave the passphrase filed blank in order to automatically login via ssh.

$ ssh-keygen -t rsa
$ ssh-copy-id master.hadoop.lan

Configure SSH Key Based Authentication

Step 3: Configure Hadoop in CentOS 7

10. Now it’s time to setup Hadoop cluster on a single node in a pseudo distributed mode by editing its configuration files.

The location of hadoop configuration files is $HADOOP_HOME/etc/hadoop/, which is represented in this tutorial by hadoop account home directory (/opt/hadoop/) path.

Once you’re logged in with user hadoop you can start editing the following configuration file.

The first to edit is core-site.xml file. This file contains information about the port number used by Hadoop instance, file system allocated memory, data store memory limit and the size of Read/Write buffers.

$ vi etc/hadoop/core-site.xml

Add the following properties between <configuration> ... </configuration> tags. Use localhost or your machine FQDN for hadoop instance.

<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.lan:9000/</value>
</property>

Configure Hadoop Cluster

11. Next open and edit hdfs-site.xml file. The file contains information about the value of replication data, namenode path and datanode path for local file systems.

$ vi etc/hadoop/hdfs-site.xml

Here add the following properties between <configuration> ... </configuration> tags. On this guide we’ll use /opt/volume/ directory to store our hadoop file system.

Replace the dfs.data.dir and dfs.name.dir values accordingly.

<property>
<name>dfs.data.dir</name>
<value>file:///opt/volume/datanode</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///opt/volume/namenode</value>
</property>

Configure Hadoop Storage

12. Because we’ve specified /op/volume/ as our hadoop file system storage, we need to create those two directories (datanode and namenode) from root account and grant all permissions to hadoop account by executing the below commands.

$ su root
# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode
# chown -R hadoop:hadoop /opt/volume/
# ls -al /opt/  #Verify permissions
# exit  #Exit root account to turn back to hadoop user

Configure Hadoop System Storage

13. Next, create the mapred-site.xml file to specify that we are using yarn MapReduce framework.

$ vi etc/hadoop/mapred-site.xml

Add the following excerpt to mapred-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Set Yarn MapReduce Framework

14. Now, edit yarn-site.xml file with the below statements enclosed between <configuration> ... </configuration> tags:

$ vi etc/hadoop/yarn-site.xml

Add the following excerpt to yarn-site.xml file:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Add Yarn Configuration

15. Finally, set Java home variable for Hadoop environment by editing the below line from hadoop-env.sh file.

$ vi etc/hadoop/hadoop-env.sh

Edit the following line to point to your Java system path.

export JAVA_HOME=/usr/java/default/

Set Java Home Variable for Hadoop

16. Also, replace the localhost value from slaves file to point to your machine hostname set up at the beginning of this tutorial.

$ vi etc/hadoop/slaves

Step 4: Format Hadoop Namenode

17. Once hadoop single node cluster has been setup it’s time to initialize HDFS file system by formatting the /opt/volume/namenode storage directory with the following command:

$ hdfs namenode -format

Format Hadoop Namenode

Hadoop Namenode Formatting Process

Step 5: Start and Test Hadoop Cluster

18. The Hadoop commands are located in $HADOOP_HOME/sbin directory. In order to start Hadoop services run the below commands on your console:

$ start-dfs.sh
$ start-yarn.sh

Check the services status with the following command.

$ jps

Start and Test Hadoop Cluster

Alternatively, you can view a list of all open sockets for Apache Hadoop on your system using the ss command.

$ ss -tul
$ ss -tuln # Numerical output

Check Apache Hadoop Sockets

19. To test hadoop file system cluster create a random directory in the HDFS file system and copy a file from local file system to HDFS storage (insert data to HDFS).

$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage

Check Hadoop Filesystem Cluster

To view a file content or list a directory inside HDFS file system issue the below commands:

$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/

List Hadoop Filesystem Content

Check Hadoop Filesystem Directory

To retrieve data from HDFS to our local file system use the below command:

$ hdfs dfs -get /my_storage/ ./

Copy Hadoop Filesystem Data to Local System

Get the full list of HDFS command options by issuing:

$ hdfs dfs -help

Step 6: Browse Hadoop Services

20. In order to access Hadoop services from a remote browser visit the following links (replace the IP Address of FQDN accordingly). Also, make sure the below ports are open on your system firewall.

For Hadoop Overview of NameNode service.

http://192.168.1.41:50070 

Access Hadoop Services

For Hadoop file system browsing (Directory Browse).

http://192.168.1.41:50070/explorer.html

Hadoop Filesystem Directory Browsing

For Cluster and Apps Information (ResourceManager).

http://192.168.1.41:8088 

Hadoop Cluster Applications

For NodeManager Information.

http://192.168.1.41:8042 

Hadoop NodeManager

Step 7: Manage Hadoop Services

21. To stop all hadoop instances run the below commands:

$ stop-yarn.sh
$ stop-dfs.sh

Stop Hadoop Services

22. In order to enable Hadoop daemons system-wide, login with root user, open /etc/rc.local file for editing and add the below lines:

$ su - root
# vi /etc/rc.local

Add these excerpt to rc.local file.

su - hadoop -c "/opt/hadoop/sbin/start-dfs.sh"
su - hadoop -c "/opt/hadoop/sbin/start-yarn.sh"
exit 0

Enable Hadoop Services at System-Boot

Then, add executable permissions for rc.local file and enable, start and check service status by issuing the below commands:

$ chmod +x /etc/rc.d/rc.local
$ systemctl enable rc-local
$ systemctl start rc-local
$ systemctl status rc-local

Enable and Check Hadoop Services

That’s it! Next time you reboot your machine the Hadoop services will be automatically started for you! All you need to do is to fire-up a Hadoop compatible application and you’re ready to go!

For additional information please consult official Apache Hadoop documentation webpage and Hadoop Wiki page.

Leave a Reply

Your email address will not be published. Required fields are marked *