Install Hadoop on Linux - Ultimate tutorial

Lately, in our Big Data course at University, we were required to install Hadoop and made a report about the installation progress. Having completed it, I thought I should share my experience.

If you are into Big data, you must have already heard about Hadoop.

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Though as popular as it seems, the installation progress is a bit intimidating for new users. You must have guessed that “Oh, its popular so it must have a straight forward installation progress”, but nope, at least it’s not easy with the manual installation progress.

/2022/04/install-hadoop-on-linux/hadoop-big-data.jpg
Hadoop installation is scary

In this post, I will demonstrate two approaches to install Hadoop. Manual installation and Docker.

Info

The distribution installed in this post is Apache Hadoop downloaded from:

https://hadoop.apache.org/releases.html.

 Version: Version NEW | 3.2.3 (released on March 28, 2022).

Firstly, to install hadoop, we need to install the Java environment for the operating system. Check if java is available on the machine by typing the following command:

1
java -version
/2022/04/install-hadoop-on-linux/hadoop-install-1.png
Java checking

According to the results displayed from the terminal, Java is already installed on my desktop, so we do not have to reinstall Java anymore. Java JDK used is openjdk 17.0.3. If you don’t have Java installed, please head over to ArchWiki for the tutorial.

We need OpenSSH for this installation so let’s install it, shall we? To install OpenSSH on Arch Linux, type the following command into the terminal:

1
sudo pacman -S openssh
/2022/04/install-hadoop-on-linux/hadoop-install-2.png
Install openSSH

Press Y to confirm the installation. Next step, you have to enable systemd service for ssh:

1
sudo systemctl start sshd.service

Finally, we need to configure SSH passwordless, type those commands into your terminal:

1
2
3
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

If ssh has been properly set up, you should get the following output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Generating public/private rsa key pair.
Your identification has been saved in /home/ashpex/.ssh/id_rsa
Your public key has been saved in /home/ashpex/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:Ic5SYgbyl1S4gUbsEoi3pgLu/fA3FHkLmUNFXJLjaU ashpex@archlinux
The key's randomart image is:
+---[RSA 3072]----+
|..=o=+.     +.o=.|
| oS=S=     o *.o.|
| .. Xo+ o   = E  |
|   +.* = . =     |
|    . + S o .    |
|     . + . o     |
|      . = .      |
|       o.o       |
|      oo...      |
+----[SHA256]-----+
Caution
If you get the error ssh: connect to host localhost port 22: Connection refused it means that openssh is not installed or the ssh service has not been started. Please check the installation process again.

We can go to the Apache Hadoop home page to select and download the installation file.

Or use wget to download the package directly:

1
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
/2022/04/install-hadoop-on-linux/hadoop-install-3.png
Output

Check the downloaded archive:

/2022/04/install-hadoop-on-linux/hadoop-install-4.png
File has been downloaded

After downloading the installation package to your computer, to make sure the package is safe, you can check the signature of the file with PGP or check the checksum SHA-512 by typing the following command (at the directory path containing the downloaded file):

1
shasum -a 512 hadoop-3.2.3.tar.gz
/2022/04/install-hadoop-on-linux/hadoop-install-5.png
Check checksum

Compare the results with the Apache Hadoop checksum file over here: https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz.sha512

Type the following command to extract the installation file:

1
tar xzf hadoop-3.2.3.tar.gz

After extracting, we will get the following files (in the directory hadoop-3.2.3)

/2022/04/install-hadoop-on-linux/hadoop-install-6.png
Hadoop directory

To make it easier to work for later steps in this tutorial, we will rename the folder hadoop-3.2-3 (after extracting) to hadoop. Now the hadoop directory will be located at ~/Downloads/hadoop.

1
mv hadoop-3.2.3 hadoop
Caution
This is the most important step, failure in following these instructions may lead to incorrect hadoop installation.

Next, we need to set the environment variable by editing the file .zshrc (depending on the shell in use, you may need to edit other file, in most cases we usually edit the .bashrc file. as this is the default shell in most Linux distributions):

Edit the file .zshrc at ~/.zhrc by typing the command:

1
vim ~/.zshrc 

Add the following environment variables:

1
2
3
4
5
export JAVA_HOME='/usr/lib/jvm/java-17-openjdk'
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=~/Downloads/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
/2022/04/install-hadoop-on-linux/hadoop-install-7.png
Editting ~/.zshrc

After editing the .zshrc file, we need to run the following command to update the shell:

1
source ~/.zshrc

Next, we need to add the Java environment variable to the file hadoop-env.sh at the path ~/Downloads/hadoop/etc/hadoop/hadoop-env.sh:

Use a text editor (vim) to edit the file:

1
vim ~/Downloads/hadoop/etc/hadoop/hadoop-env.sh

Add the following Java environment variable:

1
export JAVA_HOME='/usr/lib/jvm/java-17-openjdk'

Similarly, edit the file /etc/hadoop/core-site.xml to add the following lines:

1
2
3
4
5
6
<configuration>
     <property>
         <name>fs.defaultFS</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>
/2022/04/install-hadoop-on-linux/hadoop-install-8.png
/etc/hadoop/core-site.xml

Next step, edit etc/hadoop/hdfs-site.xml to add the following lines:

1
2
3
4
5
6
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
/2022/04/install-hadoop-on-linux/hadoop-install-9.png
hdfs-site.xml

Edit etc/hadoop/mapred-site.xml to add the following lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<configuration>
     <property>
         <name>mapreduce.framework.name</name>
         <value>yarn</value>
     </property>
     <property>
         <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>
/2022/04/install-hadoop-on-linux/hadoop-install-10.png
mapred-site.xml

Finally, edit etc/hadoop/yarn-site.xml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <property>
 		<name>yarn.resourcemanager.hostname</name>
 		<value>127.0.0.1</value>
    </property>
    
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
     <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>
/2022/04/install-hadoop-on-linux/hadoop-install-11.png
yarn-site.xml

Format HDFS namenode:

1
hdfs namenode -format
/2022/04/install-hadoop-on-linux/hadoop-install-12.png
Format HDFS namenode

Run the following command to start the NameNode and DataNode daemon:

1
2
cd ~/Downloads/hadoop/sbin
./start-dfs.sh
/2022/04/install-hadoop-on-linux/hadoop-install-13.png
Start namenode and datanode

After the NameNode and DataNode are successfully started, we proceed to start the YARN resource manager and nodemanager:

1
./start-yarn.sh
/2022/04/install-hadoop-on-linux/hadoop-install-14.png
Start resource manager

Check the status of jps by typing the command:

1
jps

When the services are started successfully, we will see four processes as shown bellow:

/2022/04/install-hadoop-on-linux/hadoop-install-15.png
jps output

To stop Hadoop services, type the following commands:

1
2
cd ~/Downloads/hadoop/sbin
./stop-all.sh
/2022/04/install-hadoop-on-linux/hadoop-install-16.png
Stop Hadoop services

Congratulations! You have successfully install Hadoop in your Linux machine. If you have any questions, feel free to ask in the comment section bellow or contact me directly. Until next time!