Integrating LVM with Hadoop Cluster

6 min readMar 9, 2021

In this article, I am going to show how to provide elasticity to DataNode Storage in Hadoop Cluster by integrating Hadoop with LVM (Logical Volume Management) Concept.

🔰Pre-requisites :: Hadoop should be installed on your system.

I am doing this practical on RedHat Linux 8.

🔰🔰 HADOOP 🔰🔰

Hadoop is an open source framework from Apache and is used to store process and analyze big data which are very huge in volume. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

Namenode is the master node that runs on a separate node in the cluster.
DataNode is a slave node in Hadoop Cluster that stores the actual data as instructed by the NameNode.

Our Goal is to make the Data Node storage elastic so that whenever we required to change the size of Data Node storage we can achieve it without shutting down without stopping the services.

🔰🔰 Logical Volume Management 🔰🔰

Logical Volume Management (LVM) makes it easier to manage disk space.
LVM is a tool for logical volume management which includes allocating disks, striping, mirroring and resizing logical volumes.
If a file system needs more space, it can be added to its logical volumes from the free spaces in its volume group and the file system can be re-sized as we wish.

🔰 Layers of LVM 🔰

The basic layers that LVM uses, starting with the most primitive, are:

👉Physical Volumes:

LVM utility prefix: pv...
Description: Physical block devices or other disk-like devices are used by LVM as the raw building material for higher levels of abstraction. Physical volumes are regular storage devices. LVM writes a header to the device to allocate it for management.

👉Volume Groups:

LVM utility prefix: vg...
Description: LVM combines physical volumes into storage pools known as volume groups. Volume groups abstract the characteristics of the underlying devices and function as a unified logical device with combined storage capacity of the component physical volumes.

👉Logical Volumes:

LVM utility prefix: lv...
Description: A volume group can be sliced up into any number of logical volumes. Logical volumes are functionally equivalent to partitions on a physical disk, but with much more flexibility. Logical volumes are the primary component that users and applications will interact with.

🔰🔰 Getting Started with LVM 🔰🔰

👉Step-1 : Check list of partitions

To check number of storages and to list the partitions on your system, run this command:

# fdisk -l

👉Step-2 : Create Physical Volumes

Creating two physical volumes using command:

# pvcreate <disk-name>

Use this command to display physical volumes created.

# pvdisplay <disk-name>

In above output, we can clearly see that no VG (volume group) is allocated till now!
So, lets create Volume Group.

👉Step-3 : Create Volume Group

Creating Volume Group from above created physical volumes.

# vgcreate <vg-name> <pv1> <pv2>
# vgdisplay <vg-name>

Volume Group “myvg” has (6 + 8 )= 14 G storage.
Now, if you again run “pvdisplay” command, you will see that above created VG “myvg” is allocated!

👉Step-4 : Create Logical Volume

Create one Logical Volume (LV) from above created VG.

# lvcreate --size <size> --name <lv-name> <vg-name>
# lvdisplay  <lv-name>/<vg-name>

After creating the LV, we can’t directly use this storage.
For using this storage, first we need to format it.
Then Mount it on another folder, whose storage we are gonna contribute to Hadoop Cluster.

👉Step-5 : Format the Logical Volume Partition

To format the LV, use ext4 type

mkfs.ext4  /dev/<lv-name>/<vg-name>

👉Step-6 : Mount the Logical Volume Partition

To mount the LV, first create one folder
Then mount the LV on that folder, using command

# mount  /dev/<lv-name>/<vg-name>  <folder-name>

👉Step-7 : Check available disks space

To check if above mounted folder is available on filesystem, use this command:

# df -h

As we can see, folder “/lvm_datanode” of size ~10G is available!

🔰🔰 Integrating LVM with Hadoop 🔰🔰

Configure your system (in which you created LVM partitions) — as DataNode.
Provide “/lvm_datanode” as DataNode storage to Hadoop Cluster.

🔰🔰 Getting Started with Hadoop 🔰🔰

For configuring Hadoop Cluster, I have 2 systems — One for NameNode and one for DataNode.
Install Hadoop in both systems and go to “/etc/hadoop” folder and do configuration, then start the Hadoop Service.

👉Step-1 : Configure hdfs-site.xml in NameNode

👉Step-2 : Configure core-site.xml

Both NameNode and DataNode have same configuration in core-site.xml

👉Step-3 : Configure hdfs-site.xml in DataNode

👉Step-4 : Start the services

To start the services in NameNode, run the following commands:

# hadoop  namenode  -format
# hadoop-daemon.sh start namenode

To start the services in DataNode, run the following command:

# hadoop-daemon.sh start datanode

👉Step-5 : Check the Hadoop Report

After Starting Services, check Hadoop Report

# hadoop dfsadmin -report

As we can see, (9.78 GB)~10 GB storage is allocated to Hadoop Cluster!
Now, we are going to extend the LVM partition to see if allocated storage of DataNode is also increased or not.

👉Step-6 : Extend the DataNode Storage

To extend the partition, run this command:

# lvextend --size +<size> /dev/<lv-name>/<vg-name>

After extending, reformat this partition using command:

# resize2fs /dev/<lv-name>/<vg-name>

Again run report command to see contributed storage of DataNode now.
As we can see, (11.75 GB)~12 GB storage is allocated to Hadoop Cluster!
We can also run “df -h” command to see the size of “lvm_datanode”

We can also reduce LVM partition, but it’s not a good practice to reduce the storage as it can result in data loss.

Conclusion

The main advantages of LVM are increased abstraction, flexibility, and control. Volumes can be resized dynamically as space requirements change and migrated between physical devices within the pool on a running system or exported easily.

Integrating LVM with Hadoop is very useful and time saving approach. It provides Elasticity to DataNode Storage. We can increase storage of Datanode without shutting down without stopping the services.