How to Install Hadoop on Ubuntu: Quick & Easy Steps

Learn how to install Hadoop on Ubuntu with our easy guide. Start your big data journey today!

Updated: 07 Oct, 24 by Lisa P 14 Min

List of content you will read in this article:

Do you think can a person build a house alone? So, let`s imagine he can, but it would take forever, right? So, what about the computers? What happens if we assign them hundreds of tasks to do in a short time? Actually, it takes two to tango! As a result, if we want computers to work faster and easier, we should connect lots of computers together into a network. But in this case, we need something to manage the things and that`s where Hadoop comes in. It is a super-smart network manager that ensures the connected computers work together smoothly and efficiently. You may be encouraged to install Hadoop on Ubuntu now. So, that`s why we prepared this tutorial for you. 

It is not easy for one computer to analyze mountains of data. But what if we connect them together and turn them into a powerful team of computers? In this way, they can handle huge tasks. Hadoop is a powerful set of tools that can help you to connect those computers together. It’s been around 15 years that it is helping people and businesses to handle many challenges and save a lot of money. Hadoop is super versatile due to four main parts that it has, each part has a specific job to do. These parts include HDFS, YARN, MapReduce, and Hadoop Common. 

With all that, what are the benefits of installing Hadoop? What are its use cases? Or why should we install it? is it worth it? Understanding Hadoop's potential use cases is necessary, but seeing how it can be utilized in everyday situations shows its true value. Here are a few practical reasons to install Hadoop:

Analyzing Business Risks

Hadoop enables firms to handle and analyze large amounts of data effectively, making it easier to spot dangers. For example, hospitals utilize Hadoop to assess treatment risks and forecast patient outcomes. This approach can be applied to any industry to improve decision-making.

Safeguarding Against Security Breaches

As businesses develop and networks expand, the chance of an attack on security increases. Hadoop can analyze huge quantities of data and quickly identify potential weaknesses, allowing companies to increase their security procedures.

Evaluating Consumer Insights

Collecting client input is essential for enhancing products and creating new initiatives. Hadoop improves this process by processing massive review datasets in just a little of the time it would take a human, resulting in immediate insights for decision-making.

Examining Market Dynamics

Evaluating the market potential for new products is essential but data-intensive. Hadoop's capacity to handle massive amounts of information enables even tiny organizations to study market trends and make effective strategic choices without expensive technologies.

Monitoring Log Activities

As businesses grow and implement more software, keeping log files and finding problems gets more difficult. Hadoop makes this easier by scanning and analyzing log files quickly, detecting problems, and increasing system efficiency.

To understand how Hadoop works, we must examine its building blocks. Hadoop is comprised of multiple primary parts that work together to handle massive amounts of data. Let us break down what these parts are.

HDFS: The City's Warehouse

HDFS resembles the city's enormous warehouse. It holds all data, including little files and large datasets. It's meant to be extremely dependable and can manage even the largest files. Actually, it is the storage backbone of the Hadoop ecosystem. It splits data into blocks (usually 128 MB each) and stores several copies of these blocks on separate nodes to ensure dependability and fault tolerance. The key features of HDFS are:

  • Data Distribution: Data is split into chunks and distributed across various nodes.
  • Fault Tolerance: By replicating data across multiple nodes, HDFS ensures data is safe even if some nodes fail.

YARN: The City's Traffic Controller

YARN is the city’s traffic controller. It assigns duties to certain computers (or "nodes") and ensures that everything goes properly. It functions similarly to a traffic cop, directing vehicles to the appropriate locations. It serves as a resource manager, managing the distribution of computing resources among multiple applications. YARN consists of three major components:

  • Resource Manager: Allocates resources across the cluster.
  • Application Master: Manages the execution of applications.
  • Node Manager: Oversees resources and tasks on individual nodes.

MapReduce: The City's Data Processing Plant

MapReduce is the city's data processing facility. It takes raw data, divides it into smaller bits, processes each piece separately, and then reassembles it. It functions like a factory, producing completed goods from raw materials. For example, if you want to analyze a large dataset of text, MapReduce will break down the text into manageable chunks, process each chunk, and then combine the results to give you the final output. The process consists of two basic phases:

  • Map Phase: Data is split into smaller pieces and processed in parallel across the cluster.
  • Reduce Phase: Processed data is aggregated and combined to produce the final result.

Hadoop Common: The City's Watchman

Hadoop Common offers the libraries and tools needed by the other Hadoop modules. It includes the necessary tools and support for activities such as configuration, serialization, and file management. Zookeeper, a key component of Hadoop Common, organizes and manages distributed applications throughout the Hadoop ecosystem. Zookeeper is a coordination service for managing and synchronizing distributed applications. It is especially handy for managing chores like:

  • Coordination: Ensuring smooth operation between different components of Hadoop.
  • Failover Management: Detecting and recovering from failures to maintain system stability.

When these parts work together, they create a powerful system that can handle huge amounts of data on a network of computers. Understanding how each part works is important if you want to use Hadoop for your data tasks. Ok, now you know everything about this. Let`s see how to install Hadoop on Ubuntu.

Now that you know what Hadoop is built of, let's get to the main part: install Hadoop on Ubuntu system! Before we begin, let's make sure you have everything you need.

Prerequisites 

  1. Hardware: Your computer should have at least 4GB of RAM and 60GB of storage for it to work well.
  2. Java: Make sure you have Oracle Java 8 installed. You can check if Java is installed by typing this in your terminal:

java -version

Step 1: Set Up Secure, Passwordless SSH

Before Hadoop can work effectively, it needs to communicate between nodes without requiring passwords. This is achieved through SSH key pairs.

a) Install OpenSSH Server and Client

SSH (Secure Shell) is crucial for managing and connecting to remote systems. To install OpenSSH Server and Client, use:

sudo apt-get install openssh-server openssh-client

This command installs both the server (which allows other systems to connect to your machine) and the client (which allows your machine to connect to other systems).

b) Generate Public and Private Key Pairs

Create SSH key pairs that will be used for passwordless authentication:

ssh-keygen -t rsa -P ""

Here’s what each part means:

  • `-t rsa`: Specifies the type of key to create (RSA is a common type).
  • `-P ""`: Indicates an empty passphrase, meaning you won't need to enter a passphrase when using the key.

When prompted to enter a file name, just press Enter to use the default location (`~/.ssh/id_rsa`). This generates a private key (`id_rsa`) and a public key (`id_rsa.pub`).

c) Configure Passwordless SSH

Add your public key to the `authorized_keys` file to allow passwordless login:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

This command appends the content of your public key file to the `authorized_keys` file, which is used to authenticate SSH connections.

d) Verify Passwordless SSH

Check that passwordless SSH is working by connecting to localhost:

ssh localhost

If everything is set up correctly, you should be able to connect without entering a password.

e) Install rsync

`rsync` is a tool for synchronizing files and directories between systems. Install it with:

sudo apt-get install rsync

This tool will be useful for managing data transfers.

Step 2: Getting Started with Hadoop Setup

Now, you need to set up Hadoop. To do that:

a) Download Hadoop

Download the Hadoop binary distribution from Apache Mirrors. For Hadoop 2.8.x, use this link: http://www-eu.apache.org/dist/hadoop/common/hadoop-2.8.2/

Download the tar.gz file to your local machine.

b) Extract Hadoop

Unpack the downloaded tarball to your home directory or another location:

tar xzf hadoop-2.8.2.tar.gz`

This command extracts the Hadoop files into a directory named `hadoop-2.8.2`.

Step 3: Configure Essential Hadoop Parameters

In this step, you can set up needed parameters. To do that, follow these steps:

a) Configure Environment Variables

Edit your `.bashrc` file to set environment variables for Hadoop:

nano ~/.bashrc

Add the following lines at the end of the file:

export HADOOP_HOME=/home/hduser/hadoop-2.8.2

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

This sets up necessary paths and environment variables for Hadoop. 

Apply the changes to your current session:

source ~/.bashrc

b) Update Hadoop Configuration Files

Edit `hadoop-env.sh`:

Set the path for Java, which Hadoop requires:

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add:

export JAVA_HOME=<root directory of Java-installation> (e.g., /usr/lib/jvm/jdk1.8.0_151/)

Edit `core-site.xml`:

Configure Hadoop to use the HDFS file system:

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add:

<configuration>

  <property>

    <name>fs.defaultFS</name>

    <value>hdfs://localhost:9000</value>

  </property>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/home/hdadmin/hdata</value>

  </property>

</configuration>

Edit `hdfs-site.xml`:

Set the replication factor for HDFS:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add:

<configuration>

  <property>

    <name>dfs.replication</name>

    <value>1</value>

  </property>

</configuration>

Edit `mapred-site.xml`:

Configure MapReduce to use YARN:

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add:

<configuration>

  <property>

    <name>mapreduce.framework.name</name>

    <value>yarn</value>

  </property>

</configuration>

Edit `yarn-site.xml`

Configure YARN to handle MapReduce tasks:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add:

<configuration>

  <property>

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

  </property>

  <property>

    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

    <value>org.apache.hadoop.mapred.ShuffleHandler</value>

  </property>

</configuration>

Step 4: Activate Your Hadoop Cluster

In this step, you need to begin Hadoop cluster operations. All you need to do is the following steps.

a) Format the Namenode

Initialize the HDFS file system:

hdfs namenode -format

This command prepares the namenode for first-time use.

b) Start HDFS

Start the HDFS services with:

start-dfs.sh

This script starts the DataNode, Namenode, and Secondary Namenode processes.

c) Start YARN

Start YARN with:

start-yarn.sh

This script starts the ResourceManager and NodeManager processes.

d) Verify the Processes

Check that Hadoop services are running:

jps

You should see processes like `DataNode`, `ResourceManager`, `NameNode`, and `NodeManager`.

e) Access the Web Interfaces

NameNode Web UI: Go to [http://localhost:50070/] to view the HDFS status.

Resource Manager UI: Visit [http://localhost:8088] to monitor YARN resource usage and running jobs.

Step 5: Deactivate the Cluster

When you need to stop Hadoop services, use:

stop-dfs.sh

stop-yarn.sh

These scripts stop HDFS and YARN services, respectively.

You have successfully installed and configured Hadoop on Ubuntu. You can now begin testing Hadoop's capabilities for massive data processing. If you have any problems or inquiries, please do not hesitate to seek for help. You can ask your questions in the comments. 

Conclusion

So, there you have it! You have successfully learned how to install Hadoop on Ubuntu PC. You're now ready to enter into the world of big data processing and analytics. Remember that Hadoop is a strong tool for tackling complicated data difficulties. You should be aware that installing Hadoop on Ubuntu is a detailed process that requires careful attention and some study. This book will help you use Hadoop's sophisticated data analysis features to make the most of your data.

HDFS is a distributed file system for storing large datasets across multiple machines. MapReduce is a processing framework that divides tasks into smaller chunks and processes them in parallel.

No, Hadoop is not a database. Itā€™s a framework for processing and storing large datasets across a distributed cluster.

The four primary components are HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), MapReduce, and Hadoop Common.

Hadoop is typically used for large-scale data processing, storage, and analysis in industries like finance, healthcare, and e-commerce.

Lisa P

Lisa P

Hello, everyone, my name is Lisa. I'm a passionate electrical engineering student with a keen interest in technology. I'm fascinated by the intersection of engineering principles and technological advancements, and I'm eager to contribute to the field by applying my knowledge and skills to solve real-world problems.