Welcome to our BLOG.
Get the latest scoop or just follow and discuss our latest studies.
Jan
13

How To Build Optimal Hadoop Cluster

By Mladen ANTUNOVIĆ  //  Big Data, Hadoop as service  //  No Comments

Preface

Amount of data stored in database/files is growing every day, using this fact there become a need to build cheaper, mainatenable and scalable environments capable of storing  big amounts of data („Big Data“). Conventional RDBMS systems became too expensive and not scalable based on today’s needs, so it is time to use/develop new techinques that will be able to satisfy our needs.

One of the technologies that lead in these directions is Cloud computing. There are different implementation of Cloud computing but we selected Hadoop – MapReduce framework with Apache licence based on Google Map Reduce framework.

In this document I will try to explain how to build scalable Hadoop cluster where it is possible to store, index, search and maintain practically unlimited ammounts of data.

This article will cover installation and configuration steps divided into these sections:

  • Network architecture
  • Operating System
  • Hardware requirements
  • Hadoop software installation/setup

Network Architecture

Based on available documentation, it has been determined that for optimal performance, the nodes of the cloud should be as physically close together as possible. The rule of thumb is – for the best performance, the lower the network latency, the better.

In an effort to reduce the amount of background traffic, a virtual private network has been created for the cloud.  A second subnet is created for the application servers to serve as access points to the cloud.

The anticipated latency of the virtual private network is ~1-2ms. This implies that the issue of physical proximity becomes less of an issue and we should confirm this through environment testing.

Network Architecture recommendation:

  • Dedicated TOR switches to Hadoop
  • Use dedicated core switching blades or switches
  • Ensure application servers are “close” to Hadoop
  • Consider Ethernet bonding for increased capacity
Hadoop Cluster Network Architecture

Picture 1 – Hadoop Cluster Network Architecture

Operating System

We selected to use Linux as operating system for our Hadoop cloud. There are numerous different distributions of Linux OS such as Ubuntu, RedHat and CentOS and they all can be used for building Hadoop cluster. To reduce cost of support and licencing fees we will use CentOS 5.7 operating system. It is the best practice to create custom CentOS image preconfigured with pre-required software so all machines contain the same set of utilities.

Based on Cloudera recommendation following settings should be applied on OS level:

  • Filesystem
    • Ext3 file system
    • Remove atime on filesystems
    • Do not use Logical Volume Management
  • Use alternatives to manage links
  • Use configuration management system (Yum, Permission, sudoers, etc.) – Puppet
  • Reduce kernel swappiness
  • Revoke access to cloud machines from average users
  • Do NOT use Virtualization for cloud
  • At a minimum the following, but limited to Linux commands are required:
    • /etc/alternatives
    • ln, chmod, chown, chgrp, mount, umount, kill, rm, yum, mkdir

Hardware requirements

As there are two nodes type (Namenode/JobTracker and datanode/tasktracker) on Hadoop cluster there should be no more than two or three different hardware configurations.

Picture 2 – Hadoop Cluster Server Roles

Basic hardware recommendation:

  • Namenode/JobTracker (2 x 1Gb/s Ethernet, 16 GB of RAM, 4xCPU, 100 GB disk)
  • Datanode (2 x 1Gb/s Ethernet, 8 GB of RAM, 4xCPU, Multiple disks with total amount of 500+ GB)

Depending on amount of data we’re planing to store and process using our Hadoop cluster physical configuration of data nodes can vary from our recommandation. It is highly recommanded to not mix different hardware configurations when building Hadoop clusters so we can avoid possible bottlenecks caused by less powerfull machines.

Hadoop rack awareness

Hadoop has the concept of “Rack Awareness”.  As the Hadoop administrator you can manually define the rack number of each slave Data Node in your cluster.  Why would you go through the trouble of doing this?  There are two key reasons for this: Data loss prevention and network performance.
Picture 3 – Hadoop Cluster Rack Awareness
Remember that each block of data will be replicated to multiple machines to prevent the failure of one machine from losing all copies of data.  Wouldn’t it be unfortunate if all copies of data happened to be located on machines in the same rack, and that rack experiences a failure? Such as a switch failure or power failure.  So to avoid this, somebody needs to know where Data Nodes are located in the network topology and use that information to make an intelligent decision about where data replicas should exist in the cluster.  That “somebody” is the Name Node.

There is also an assumption that two machines in the same rack have more bandwidth and lower latency between each other than two machines in two different racks.  The rack switch uplink bandwidth is usually less than its downlink bandwidth.  Furthermore, in-rack latency is usually lower than cross-rack latency (but not always).

What is DISADVANTAGE about Rack Awareness at this point is the manual work required to define it the first time, continually update it, and keep the information accurate.  If the rack switch could auto-magically provide the Name Node with the list of Data Nodes it has, that would be great.

Hadoop software installation/setup process

Hadoop cloud can be built in multiple ways such as:

  1. Manually downloading tarball and copying to cluster
  2. Using Yum repositories
  3. Using Automated Deployment tools such as Puppet

Using manual way is not recommended at all, it works for small clusters (up to 4 nodes) but it leads to issues with maintenance and troubleshooting – All changes need to be applied manually using scp or ssh to all nodes.

Using deployment tool such as Puppet is prefered way of building and maintaining Hadoop cluster from follwing perspectives:

  • Setup
  • Configuration
  • Maintenance
  • Scalability
  • Monitoring
  • Troubleshooting

Puppet is an automated administrative engine for Unix/Linux systems to perform administrative tasks (such as adding users, installing packages, and updating server configurations) based on a centralized specification. We will focus on Hadoop installation using Yum and  Puppet deployment tools.

Build Hadoop cluster using Yum/Puppet

To build Hadoop cluster using Puppet following pre-requirements need to be satisfied:

  • Central repository with all required Hadoop software
  • Puppet manifests created for Hadoop software deployment
  • Puppet manifests created for configuration management of Hadoop cluster
  • Framework for cluster maintenance (mainly sh or ksh scripts) to enable start/stop/restart.
  • Build entire server using puppet (Operating System plus additonal software)

Note: For setting up Hadoop cluster using Yum, all servers are already built – Operating system and additional software packages are already installed and yum repository is already setup on all nodes.

Build Hadoop Datanode/Takstracker

To install Datanode/Tasktracker using Yum, on all datanode following commands need to be executed:

yum install hadoop-0.20-datanode –y
yum install hadoop-0.20-tasktracker –y

Puppet manifest for above task:

class setup_datanode {
if ($is_datanode == true) {
make_dfs_data_dir { $hadoop_disks: }
make_mapred_local_dir { $hadoop_disks: }
fix_hadoop_parent_dir_perm { $hadoop_disks: }
}

# fix hadoop parent dir permissions
define fix_hadoop_parent_dir_perm() {
…
}

# make dfs data dir
define make_dfs_data_dir() {
…
}

# make mapred local and system dir
define make_mapred_local_dir() {
…
}

} # setup_datanode

Build Hadoop Hadoop Namenode (and secondary namenode)

To install Namenode using Yum, on all datanode following command need to be executed:

yum install hadoop-0.20-namenode –y
yum install hadoop-0.20-secondarynamenode –y

Puppet manifest for above task:

class setup_namenode {

if ($is_namenode == true or $is_standby_namenode == true) {
...
}
exec {"namenode-dfs-perm":
...
}
exec { "make ${nfs_namenode_dir}/dfs/name":
...
}
exec { "chgrp ${nfs_namenode_dir}/dfs/name":
...
}
if ($standby_namenode_host != "") {
...
}
exec { "own $nfs_standby_namenode_dir":
...
}
}
# /standby_namenode_hadoop

if ($standby_namenode_host != "") {
...
}
exec { "own $standby_namenode_hadoop_dir":
...
}
}
}
}

class setup_secondary_namenode {

if ($is_secondarynamenode == true) {
...
}
....
}
exec {"namenode-dfs-perm":
...
}
}
}

Build Hadoop Hadoop JobTracker

To install JobTracker using Yum, on all datanode following command need to be executed:

yum install hadoop-0.20-jobtracker –y

Same puppet manifest used for building Namenode is used, there is only difference that on Jobtracker host Jobtracker will be started – this is configured by setting is_jobtracker on true for JobTracker host.