Big DataBlog

Amazon Elastic MapReduce web service – Part 1

By December 1, 2013 March 27th, 2017 No Comments

Amazon Elastic MapReduce service gives us an opportunity to easily and cost-effectively provision as much or as little capacity as we need to successfully process our data.

Introduction

Hadoop as a platform enables us to store and process vast amounts of data. Storage capacity and processing power are directly related to the size of our Hadoop cluster and scaling it up is a simple as adding new nodes to the cluster.  Amazon Elastic MapReduce service gives us an opportunity to easily and cost-effectively provision as much or as little capacity as we need to successfully process our data. We are also going to leverage Amazon Simple Storage Service (Amazon S3) service to store our input and output data.

In this series of articles I’m going to show how easily we can build and run custom Hadoop MapReduce job using Amazon Elastic MapReduce web service. First part will demonstrate basic use of EMR for running custom Hadoop Map Reduce job.

Setting up Hadoop cluster on Amazon ElasticMR

For the purpose of this article we’re going to use Amazon Elastic MapReduce service to provision us a small one-node Hadoop cluster.

To get us started we’re going to need to have Amazon Web Services (AWS) account so we can create new Job flow to execute our custom Hadoop job. Using Elastic MapReduce Command Line Interface (EMR CLI), creation of our cluster is simple as issuing following command:

 

This command will return job flow ID which we will use when working with cluster from EMR CLI. Using returned job flow ID we can check the status of our cluster at any time by running this command:

For detailed information about our job flow we can use following command:

 

Once when our job flow reaches WAITING state we’re ready to submit our Hadoop job.

Note: Detailed instructions on how to install,  setup and use Amazon EMR CLI please can be found in Amazon Elastic MapReduce – Getting Started Guide.

Creating custom Hadoop job for ElasticMR

The next step for us is to create our custom Hadoop map reduce job and execute it on our EMR cloud.

For the purpose of this article we have developed custom map reduce job for image face detection usingOpen Computer Vision Library (OpenCV) and it’s Java interface – JavaCV.

In my next article I will be explaining the details about Map Reduce job, setting up Hadoop cluster to work with OpenCV libraries and reading and writing our data from Amazon S3. Here we will be presenting only basic steps to get custom Hadoop job running on EMR cloud.

This snippet below shows our Hadoop job class:

 

It is important to underline following line:

1
As our job has multiple dependencies on third party libraries (OpenCV, JavaCV, JavaCPP,..) where some of them are Java libraries while others are native Linux binaries decision was made to not include libraries into Hadoop job jar and instead we will pass all libraries using Hadoop Distributed cache when submitting the job. This method may seem more complex than packaging all dependencies inside job jar, but I have found this to be the only method to use native Linux libraries inside Hadoop map reduce job without installing libraries on all nodes in Hadoop cluster.

As we’re using Amazon S3 for reading and writing our data in our job, we will also use S3 to upload our job jar and all its dependencies. With s3cmd command we can easily upload all files we need to S3.

Here’s the layout of our S3 bucket after we have successfully uploaded our job and all it’s dependencies to S3:

Running the job

Once when we have EMR cluster up and running, input data, job and necessary dependencies uploaded to Amazon S3 we’re ready to run our job. Following article describes how to run various types of EMR job flows (hive, pig, streaming or custom jar) – How to Create and Debug an Amazon Elastic MapReduce Job Flow.

In our case, we are not able to use proposed method of running custom Hadoop job jar because we have to use Hadoop distributed cache functionality when running our job which is not supported in latest version of EMR CLI. Because of that, we will run our job directly from our EMR cluster after we login into it using SSH.

To establish SSH connection with our cluster we need to have RSA private key (.pem) downloaded from AWS management console. With the .pem file available, we connect to Hadoop master node using “MasterPublicDnsName” found in job flow definition using following command:

Once connected, we have to setup s3cmd tool and download our Hadoop job jar:

 

To simplify running our job, we can create small batch script:


Following is the job’s output logged inside console:

 

The result of our job is set of text files (one file per each job mapper) containing information about images (S3 bucket, S3 key, size…) where face has been detected including bounding box of detected face:

Once when we’re done playing with our EMR cluster, we have to terminate it:

Summary

This article demonstrated some of the basic steps we have to follow to get custom map reduce job run on Amazon Elastic Map Reduce.

Next article in this series is going to cover the details of our custom job for image face detection.