Before reading this post, please go through my previous post at “How MapReduce Algorithm Works” to get some idea about MapReduce Algorithm. My previous post has already explained about “How MapReduce performs WordCounting” in theoretically.
And if you are not familiar with HDFS Basic commands, please go through my post at “Hadoop HDFS Basic Developer Commands” to get some basic knowledge about how to execute HDFS Commands in CloudEra Environment.
In this post, We are going to develop same WordCounting program using Hadoop 2 MapReduce API and test it in CloudEra Environment.
MapReduce WordCounting Example
We need to write the following three programs to develop and test MapReduce WordCount example:
- Mapper Program
- Reducer Program
- Client Program
NOTE:-
To develop MapReduce Programs, there are two versions of MR API:
- One from Hadoop 1.x (MapReduce Old API)
- Another from Hadoop 2.x (MapReduce New API)
In Hadoop 2.x, MapReduce Old API is deprecated. So we are gong to concentrate on MapReduce New API to develop this WordCount Example.
In CloudEra environment, They have already provided Eclipse IDE setup with Hadoop 2.x API. So it is very easy to develop and test MapReduce Programs using this setup.
To develop WordCount MapReduce Application, please use the following steps:
- Open Default Eclipse IDE provided by CloudEra Environment.
- We can use already created project or create a new Java Project.
- For simplicity, I’m going to use existing “training” Java Project. They have already added all required Hadoop 2.x Jars to this project classpath. It is ready to use Eclipse Java Project.
- Create WordCount Mapper Program
- Create WordCount Reducer Program
- Create WordCount Client Program to test this application
Let’s us start developing these three programs in next sections.
Mapper Program
Create a “WordCountMapper” Java Class which extends Mapper class as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
package com.journaldev.hadoop.mrv1.wordcount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String w = value.toString(); context.write(new Text(w), new IntWritable(1)); } } |
Code Explanation:
-
- Our WordCountMapper class has implemented Hadoop 2 MapReduce API class “Mapper”.
- Mapper class has defined by using Generic Type as Mapper<LongWritable, Text, Text, IntWritable>
Here <LongWritable, Text, Text, IntWritable>
-
- First two <LongWritable, Text> represents Input Data types to our WordCount’s Mapper Program.
For Example:- In our example, we will give a File(Huge amount of Data, any format). Mapper reads each line from this file and give one unique number as shown below
1 2 3 |
<Unique_Long_Number, Line_Read_From_Input_File> |
In Hadoop MapReduce API, it is equal to <LongWritable, Text>.
-
- Last two <Text, IntWritable> represents Output Data types of our WordCount’s Mapper Program.
For Example:- In our example, WordCount’s Mapper Program gives output as shown below
1 2 3 |
<Unique_Word_From_Input_File, Word_Count> |
In Hadoop MapReduce API, it is equal to <Text, IntWritable>.
- We have implemented Mapper’s map() method and provided our Mapping Function logic here.
Reducer Program
Create a “WordCountReducer” Java Class which extends Reducer class as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
package com.journaldev.hadoop.mrv1.wordcount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } |
Code Explanation:
-
- Our WordCountReducer class has extended Hadoop 2 MapReduce API class: “Reducer”.
- Reducer class has defined by using Generic Type as Mapper<Text, IntWritable, Text, IntWritable>
Here <Text, IntWritable, Text, IntWritable>
-
- First two <Text, IntWritable> represents Input Data types to our WordCount’s Reducer Program.
For Example:- In our example, our Mapper Program will give <Text, IntWritable> output, which will become the input of Reducer Program.
1 2 3 |
<Unique_Word_From_Input_File, Word_Count> |
In Hadoop MapReduce API, it is equal to <Text, IntWritable>.
-
- Last two <Text, IntWritable> represents Output Data types of our WordCount’s Reducer Program.
For Example:- In our example, WordCount’s Reducer Program gives output as shown below
1 2 3 |
<Unique_Word_From_Input_File, Total_Word_Count> |
In Hadoop MapReduce API, it is equal to <Text, IntWritable>.
- We have implemented Reducer’s reduce() method and provided our Reduce Function logic here.
Client Program
Create a “WordCountClient” Java Class with main() method as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
package com.journaldev.hadoop.mrv1.wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCountClient { public static void main(String[] args) throws Exception { Job job = Job.getInstance(new Configuration()); job.setJarByClass(WordCountClient.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean status = job.waitForCompletion(true); if (status) { System.exit(0); } else { System.exit(1); } } } |
Code Explanation:
- Hadoop 2 MapReduce API has “Job” class at “org.apache.hadoop.mapreduce” package.
- Job Class is used to create Jobs (Map/Reduce Jobs) to perform our WordCounting tasks.
- Client program is using Job Object’s setter methods to set all MapReduce Components like Mapper, Reducer, Input Data Type, Output Data type etc.
- These Jobs will perform our WordCounting Mapping and Reducing tasks.
NOTE:-
- As we discussed in my previous post, MapReduce algorithm uses 3 functions: Map Function, Combine Function and Reduce Function.
- By observing these 3 programs, we can find out one thing that we have developed only only two functions : Map and Reduce. Then What about Combine function?
- That means we have used default Combine function logic available in Hadoop 2 MapReduce API.
- We will discuss on “How to develop Combine Function” in my coming posts.
Now we have developed all required components (programs). It’s time to test it.
Test MapReduce WordCounting Example
Our WordCounting project final structure looks like this:
Please use the following steps to test our MapReduce Application.
-
- Create our WordCount application JAR file using Eclipse IDE.
-
- Execute the following “hadoop” command to run our WordCounting Application
Syntax:-
1 2 3 |
hadoop jar <our-Jar-file-path> <Client-program> <Input-Path> <Output-Path> |
Let us assume that we have already created “/ram/mrv1/output” folder structure in Hadoop HDFS FileSytem. If you are not performed that, please go through my previous post at “Hadoop HDFS Basic Developer Commands” to create them.
Example:-
1 2 3 4 5 6 |
hadoop jar /home/cloudera/JDWordCountMapReduceApp.jar com.journaldev.hadoop.mrv1.wordcount.WordCountClient /ram/mrv1/NASDAQ_daily_prices_C.csv /ram/mrv1/output |
NOTE:-
Just for simple readability purpose, I’ve provided command into multiple lines. Please type this command in single line as shown below:
By going through this log, we can observe that how Map and Reduce jobs work to solve our WordCounting problem.
- Execute the following “hadoop” command to view the output directory content
1 2 3 |
hadoop fs -ls /ram/mrv1/output/ |
It shows the content of “/ram/mrv1/output/” directory as shown below:
- Execute the following “hadoop” command to view our WordCounting Application output
1 2 3 |
hadoop fs -cat /ram/mrv1/output/part-r-00000 |
This command displays WordCounting Application output. As my output file is too big, I’m not able to show you my file output here.
NOTE:-
Here we have used some Hadoop HDFS commands to run and test our WordCounting Application. If you are not familiar with HDFS commands, please go through my “Hadoop HDFS Basic Developer Commands” post.
That’s it all about Hadoop 2.x MapReduce WordCounting Example. We will develop some more useful MapReduce programs in my coming posts.
Please drop me a comment if you like my post or have any issues/suggestions.