Automating Hadoop Services Using Ansible

Chetna Manku
7 min readMar 21, 2021

In this article, I’m going to show you how to start Hadoop Services using Ansible.

What is Ansible?

Ansible is an agentless automation tool that by default manages machines over the SSH protocol.

You only need to install Ansible on Controller Node. Controller Node then uses SSH (by default) to communicate with Managed or Target Nodes (those end devices you want to automate).

Refer to this article for more information on Ansible.

🔰Pre-requisites🔰

  • Ansible should be installed on your system.
  • Managed Nodes (systems on which you want to Configure WebServer) should have Internet Connectivity.

To know more about installation of Ansible, refer to this article.

🔰Ansible Playbooks🔰

  • Ansible playbooks are a vital part of Ansible and the core component of every Ansible configuration.
  • An Ansible playbook is a file where users write Ansible code, an organized collection of scripts defining the work of a server configuration. Ansible plays are written in YAML.

What is Hadoop?

Hadoop is an open source framework from Apache and is used to store process and analyze big data which are very huge in volume. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

  • NameNode is the master node that runs on a separate node in the cluster.
  • DataNode is a slave node in Hadoop Cluster that stores the actual data as instructed by the NameNode.

🔰Ansible Document🔰

  • Ansible provides documentation of each module.
  • Use “ansible-doc” command to see any module’s description and required options.
# ansible-doc <module_name>

To create any playbook it’s important to set a goal, the steps needed to complete the task.

🔰Following are the steps needed to create a playbook for Starting Hadoop Services🔰

🔹 Install JDK in Namenode & Datanode

🔹 Install Hadoop in Namenode & Datanode

🔹 Configure NameNode

🔹 Format NameNode

🔹 Start NameNode Service

🔹 Configure DataNode

🔹 Start DataNode Service

🔰Creating Inventory🔰

  • Create host groups in inventory
  • Group namenode contains IP of a namenode
  • Group datanode contains IPs of datanodes

🔰Create Variable Files🔰

  • Instead of inserting variable into playbook, it’s a good practice to define variables into another yaml file.
- hadoop: "hadoop-1.2.1-1.x86_64.rpm"
- jdk: "jdk-8u171-linux-x64.rpm"
- ip_namenode: <IP-of-master>
- port_no: 9001
- nn_dir: "/hdmaster"
- dn_dir: "/hdslave"
  • To know the IP of maser run this adhoc command for namenode group :
# ansible namenode -m setup -a "filter=*ipv4"
  • Enter this IP of master into your variable file.

🔰Create Configuration Files🔰

  • Create files in your Controller Node to configure NameNode and DataNode.
  • Create two xml files to configure hdfs-site.xml.
  • Create one xml file to configure core-site.xml, as namenode & datanode have same content.

🔰Creating Ansible Playbook🔰

This playbook contains three plays

  • First play — Install JDK and Hadoop in both Namenode & Datanode.
  • Second play — Configure NameNode
  • Third play — Configure DataNode

.

➖ Play 1: Install JDK and Hadoop in both Namenode & Datanode

- name: Install JDK and Hadoop in Namenode & Datanode
hosts: all
tasks:

.

👉Step-1 : Copy JDK and Hadoop

  • Download JDK and Hadoop software in Controller Node.
  • By using ‘copy’ module, copy these software to all managed nodes.
- name: Copying JDK rpm file
copy:
src: "/root/{{ jdk }}"
dest: "/root/"
- name: Installing JDK
command: "rpm -i {{ jdk }}"
ignore_errors: yes

.

👉Step-2 : Install JDK and Hadoop

  • Install the softwares using ‘command’ module.
  • Use ‘ignore_errors’ option because if softwares were already installed in managed nodes then this task will give error and playbook will fail.
- name: Copying Hadoop rpm file
copy:
src: "/root/{{ hadoop }}"
dest: "/root/"
- name: Installing Hadoop
command: "rpm -i {{ hadoop }} --force"
ignore_errors: yes

.

➖ Play 2: Configure NameNode

- name: Configure Namenode
hosts: namenode
tasks:

.

👉Step-1 : Create NameNode folder

  • Create one folder in NameNode.
- name: Creating NameNode folder
file:
dest: "{{ nn_dir }}"
state: directory

.

👉Step-2 : Configure hdfs-site.xml

  • Go to /etc/hosts/ and edit hdfs-site.xml.
- name: Configuring hdfs-site.xml in NameNode
template:
src: "/root/playbooks/hdfs-nn.xml"
dest: "/etc/hadoop/hdfs-site.xml"

.

👉Step-3 : Configure core-site.xml

  • Go to /etc/hosts/ and edit core-site.xml.
- name: Configuring core-site.xml in NameNode
template:
src: "/root/playbooks/core.xml"
dest: "/etc/hadoop/core-site.xml"

.

👉Step-4 : Format the NameNode

  • To format the namenode folder use ‘shell’ module.
- name: Formatting Namenode Folder
shell: "echo Y | hadoop namenode -format"

.

👉Step-5 : Stopping Firewall Service

  • Stop the firewall in namenode system, so that datanodes can connect.
service:
name: "firewalld"
state: stopped

.

👉Step-6 : Start NameNode Service

  • Start the NameNode using ‘command’ module.
- name: Starting DataNode Service
command: "hadoop-daemon.sh start datanode"

.

👉Step-7 : Check Status Using jps

  • Run the ‘jps’ command in each managed node using “command” module.
  • Save the output of above command in ‘result’ variable using “register” keyword.
  • Then, print the value of variable using “debug” module.
- name: "Running JPS command to check services in NameNode"
command: "jps"
register: result
- name: "Printing JPS Output"
debug:
var: result

.

➖ Play 2: Configure DataNode

- name: Configure Datanode
hosts: datanode
tasks:

.

👉Step-1 : Create DataNode folder

  • Create one folder in DataNode.
- name: Creating DataNode folder
file:
dest: "{{ dn_dir }}"
state: directory

.

👉Step-2 : Configure hdfs-site.xml

  • Go to /etc/hosts/ and edit hdfs-site.xml.
- name: Configuring hdfs-site.xml in DataNode
template:
src: "/root/playbooks/hdfs-dn.xml"
dest: "/etc/hadoop/hdfs-site.xml"

.

👉Step-3 : Configure core-site.xml

  • Go to /etc/hosts/ and edit core-site.xml.
- name: Configuring core-site.xml in DataNode
template:
src: "/root/playbooks/core.xml"
dest: "/etc/hadoop/core-site.xml"

.

👉Step-4 : Stopping Firewall Service

  • Stop the firewall in datanode system.
service:
name: "firewalld"
state: stopped

.

👉Step-5 : Start DataNode Service

  • Start the DataNode using ‘command’ module.
- name: Starting DataNode Service
command: "hadoop-daemon.sh start datanode"

.

👉Step-6 : Check Status Using jps

  • Run the ‘jps’ command in each managed node using “command” module.
  • Save the output of above command in ‘result’ variable using “register” keyword.
  • Then, print the value of variable using “debug” module.
- name: "Running JPS command to check services in NameNode"
command: "jps"
register: result
- name: "Printing JPS Output"
debug:
var: result

.

🔰Complete Playbook🔰

To see the complete playbook, variable files, templates files — check out my GitHub Repository.

🔰Running Ansible Playbook🔰

  • To run ansible playbook, lets first check the connection to the managed nodes.
  • Run either command :
# ansible all -m ping# ansible all -a id
  • Before running playbook, you can also run the following command for syntax checking, playbook include vault file so use this option in command:
# ansible-playbook <Playbook_name> --syntax-check
  • Run the playbook using the following command:
ansible-playbook <Playbook_name>

🔰After Running Ansible Playbook🔰

  • Let’s confirm the result by running these commands on managed commands
  • Check Hadoop Report.
Hadoop Report
NameNode
DataNode 1
DataNode 2

🔰 Conclusion 🔰

From this article, we learned how to start Hadoop Services by just running Ansible Playbook in one system.

We can add as many Datanodes as required by entering the details of it in inventory. User don't need to do the configuration of each datanode separately.

--

--