Automating Hadoop Services Using Ansible

7 min readMar 21, 2021

In this article, I’m going to show you how to start Hadoop Services using Ansible.

What is Ansible?

Ansible is an agentless automation tool that by default manages machines over the SSH protocol.

You only need to install Ansible on Controller Node. Controller Node then uses SSH (by default) to communicate with Managed or Target Nodes (those end devices you want to automate).

Refer to this article for more information on Ansible.

🔰Pre-requisites🔰

Ansible should be installed on your system.
Managed Nodes (systems on which you want to Configure WebServer) should have Internet Connectivity.

To know more about installation of Ansible, refer to this article.

How to Install and Configure ANSIBLE on RHEL8

Ansible is a free and opensource automation tool that allows system administrators to configure and control hundreds of…

chetna-manku.medium.com

🔰Ansible Playbooks🔰

Ansible playbooks are a vital part of Ansible and the core component of every Ansible configuration.
An Ansible playbook is a file where users write Ansible code, an organized collection of scripts defining the work of a server configuration. Ansible plays are written in YAML.

What is Hadoop?

Hadoop is an open source framework from Apache and is used to store process and analyze big data which are very huge in volume. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

NameNode is the master node that runs on a separate node in the cluster.
DataNode is a slave node in Hadoop Cluster that stores the actual data as instructed by the NameNode.

🔰Ansible Document🔰

Ansible provides documentation of each module.
Use “ansible-doc” command to see any module’s description and required options.

# ansible-doc <module_name>

To create any playbook it’s important to set a goal, the steps needed to complete the task.

🔰Following are the steps needed to create a playbook for Starting Hadoop Services🔰

🔹 Install JDK in Namenode & Datanode

🔹 Install Hadoop in Namenode & Datanode

🔹 Configure NameNode

🔹 Format NameNode

🔹 Start NameNode Service

🔹 Configure DataNode

🔹 Start DataNode Service

🔰Creating Inventory🔰

Create host groups in inventory
Group namenode contains IP of a namenode
Group datanode contains IPs of datanodes

🔰Create Variable Files🔰

Instead of inserting variable into playbook, it’s a good practice to define variables into another yaml file.

- hadoop: "hadoop-1.2.1-1.x86_64.rpm"
- jdk: "jdk-8u171-linux-x64.rpm"
- ip_namenode: <IP-of-master>
- port_no: 9001
- nn_dir: "/hdmaster"
- dn_dir: "/hdslave"

To know the IP of maser run this adhoc command for namenode group :

# ansible namenode -m setup -a "filter=*ipv4"

Enter this IP of master into your variable file.

🔰Create Configuration Files🔰

Create files in your Controller Node to configure NameNode and DataNode.
Create two xml files to configure hdfs-site.xml.

Create one xml file to configure core-site.xml, as namenode & datanode have same content.

🔰Creating Ansible Playbook🔰

This playbook contains three plays

First play — Install JDK and Hadoop in both Namenode & Datanode.
Second play — Configure NameNode
Third play — Configure DataNode

➖ Play 1: Install JDK and Hadoop in both Namenode & Datanode

- name: Install JDK and Hadoop in Namenode & Datanode
  hosts: all
  tasks:

👉Step-1 : Copy JDK and Hadoop

Download JDK and Hadoop software in Controller Node.
By using ‘copy’ module, copy these software to all managed nodes.

- name: Copying JDK rpm file
    copy:
       src: "/root/{{ jdk }}"
       dest: "/root/"- name: Installing JDK
    command: "rpm -i {{ jdk }}"
    ignore_errors: yes

👉Step-2 : Install JDK and Hadoop

Install the softwares using ‘command’ module.
Use ‘ignore_errors’ option because if softwares were already installed in managed nodes then this task will give error and playbook will fail.

- name: Copying Hadoop rpm file
    copy:
       src: "/root/{{ hadoop }}"
       dest: "/root/"- name: Installing Hadoop
    command: "rpm -i {{ hadoop }} --force"
    ignore_errors: yes

➖ Play 2: Configure NameNode

- name: Configure Namenode
  hosts: namenode
  tasks:

👉Step-1 : Create NameNode folder

Create one folder in NameNode.

- name: Creating NameNode folder
    file:
        dest: "{{ nn_dir }}"
        state: directory

👉Step-2 : Configure hdfs-site.xml

Go to /etc/hosts/ and edit hdfs-site.xml.

- name: Configuring hdfs-site.xml in NameNode
    template: 
       src: "/root/playbooks/hdfs-nn.xml"
       dest: "/etc/hadoop/hdfs-site.xml"

👉Step-3 : Configure core-site.xml

Go to /etc/hosts/ and edit core-site.xml.

- name: Configuring core-site.xml in NameNode
    template:
       src: "/root/playbooks/core.xml"
       dest: "/etc/hadoop/core-site.xml"

👉Step-4 : Format the NameNode

To format the namenode folder use ‘shell’ module.

- name: Formatting Namenode Folder
    shell: "echo Y | hadoop namenode -format"

👉Step-5 : Stopping Firewall Service

Stop the firewall in namenode system, so that datanodes can connect.

service:
         name: "firewalld"
         state: stopped

👉Step-6 : Start NameNode Service

Start the NameNode using ‘command’ module.

- name: Starting DataNode Service
    command: "hadoop-daemon.sh start datanode"

👉Step-7 : Check Status Using jps

Run the ‘jps’ command in each managed node using “command” module.
Save the output of above command in ‘result’ variable using “register” keyword.
Then, print the value of variable using “debug” module.

- name: "Running JPS command to check services in NameNode"
    command: "jps"
    register: result- name: "Printing JPS Output"
    debug: 
       var: result

➖ Play 2: Configure DataNode

- name: Configure Datanode
  hosts: datanode
  tasks:

👉Step-1 : Create DataNode folder

Create one folder in DataNode.

- name: Creating DataNode folder
    file:
        dest: "{{ dn_dir }}"
        state: directory

👉Step-2 : Configure hdfs-site.xml

Go to /etc/hosts/ and edit hdfs-site.xml.

- name: Configuring hdfs-site.xml in DataNode
    template:
       src: "/root/playbooks/hdfs-dn.xml"
       dest: "/etc/hadoop/hdfs-site.xml"

👉Step-3 : Configure core-site.xml

Go to /etc/hosts/ and edit core-site.xml.

- name: Configuring core-site.xml in DataNode
    template:
       src: "/root/playbooks/core.xml"
       dest: "/etc/hadoop/core-site.xml"

👉Step-4 : Stopping Firewall Service

Stop the firewall in datanode system.

service:
         name: "firewalld"
         state: stopped

👉Step-5 : Start DataNode Service

Start the DataNode using ‘command’ module.

- name: Starting DataNode Service
    command: "hadoop-daemon.sh start datanode"

👉Step-6 : Check Status Using jps

Run the ‘jps’ command in each managed node using “command” module.
Save the output of above command in ‘result’ variable using “register” keyword.
Then, print the value of variable using “debug” module.

- name: "Running JPS command to check services in NameNode"
    command: "jps"
    register: result- name: "Printing JPS Output"
    debug: 
       var: result

🔰Complete Playbook🔰

To see the complete playbook, variable files, templates files — check out my GitHub Repository.

chetna14manku/ansible-hadoop

Contribute to chetna14manku/ansible-hadoop development by creating an account on GitHub.

github.com

🔰Running Ansible Playbook🔰

To run ansible playbook, lets first check the connection to the managed nodes.
Run either command :

# ansible all -m ping# ansible all -a id

Before running playbook, you can also run the following command for syntax checking, playbook include vault file so use this option in command:

# ansible-playbook <Playbook_name> --syntax-check

Run the playbook using the following command:

ansible-playbook <Playbook_name>

🔰After Running Ansible Playbook🔰

Let’s confirm the result by running these commands on managed commands
Check Hadoop Report.

🔰 Conclusion 🔰

From this article, we learned how to start Hadoop Services by just running Ansible Playbook in one system.

We can add as many Datanodes as required by entering the details of it in inventory. User don't need to do the configuration of each datanode separately.

Automating Hadoop Services Using Ansible

What is Ansible?

🔰Pre-requisites🔰

How to Install and Configure ANSIBLE on RHEL8

Ansible is a free and opensource automation tool that allows system administrators to configure and control hundreds of…

🔰Ansible Playbooks🔰

What is Hadoop?

🔰Ansible Document🔰

🔰Following are the steps needed to create a playbook for Starting Hadoop Services🔰

🔰Creating Inventory🔰

🔰Create Variable Files🔰

🔰Create Configuration Files🔰

🔰Creating Ansible Playbook🔰

➖ Play 1: Install JDK and Hadoop in both Namenode & Datanode

👉Step-1 : Copy JDK and Hadoop

👉Step-2 : Install JDK and Hadoop

➖ Play 2: Configure NameNode

👉Step-1 : Create NameNode folder

👉Step-2 : Configure hdfs-site.xml

👉Step-3 : Configure core-site.xml

👉Step-4 : Format the NameNode

👉Step-5 : Stopping Firewall Service

👉Step-6 : Start NameNode Service

👉Step-7 : Check Status Using jps

➖ Play 2: Configure DataNode

👉Step-1 : Create DataNode folder

👉Step-2 : Configure hdfs-site.xml

👉Step-3 : Configure core-site.xml

👉Step-4 : Stopping Firewall Service

👉Step-5 : Start DataNode Service

👉Step-6 : Check Status Using jps

🔰Complete Playbook🔰

chetna14manku/ansible-hadoop

Contribute to chetna14manku/ansible-hadoop development by creating an account on GitHub.

🔰Running Ansible Playbook🔰

🔰After Running Ansible Playbook🔰

🔰 Conclusion 🔰

Chetna Manku | LinkedIn

Written by Chetna Manku