Today I sat down to go through the simple Cloudera Hadoop WordCount Tutorial on a newly installed CDH cluster and I ran into a few noob issues. This took a lot longer than it should have – hopefully this will help someone else avoid a few pitfalls

The following is based on attempting to get the Example: WordCount v1.0 running.

I had no issues creating the files from the example, the input files and the java files. I started having problems when I started looking for the java dependencies and they continued from there. This is not being done on an image, this is a fully functional 4 node cluster. This is being done as root from the root home directory (this is important information for context later on).

$ javac -cp /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop/\*:/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/\* -d wordcount_classes WordCount.java
$ jar cf wordcount.jar -C wordcount_classes .

$ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input
mkdir: Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:supergroup:drwxr-xr-x
mkdir: Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:supergroup:drwxr-xr-x
mkdir: Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:supergroup:drwxr-xr-x

So, as root I don’t have access to the hdfs filesystem…

$ su hdfs
$ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input
$ hadoop fs -put file* /user/cloudera/wordcount/input
put: `file0': No such file or directory
put: `file1': No such file or directory

Since we are hdfs we need to move the file somewhere we have access to it…

$ cp file* /tmp/
$ hadoop fs -put /tmp/file* /user/cloudera/wordcount/input
$ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output
.
.
.
13/09/14 01:50:32 INFO mapred.JobClient: Task Id : attempt_201309140041_0001_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.myorg.WordCount$Reduce not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649)
        at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1073)
        at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1354)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:857)
        at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:376)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:406)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundExceptio
.
.
.     Lots of java exceptions
.
.

Evidently at the current location I don’t have permission as hdfs…

$ cp wordcount.jar /tmp/

Let’s try this again…

$ hadoop jar /tmp/wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output
13/09/14 01:53:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/09/14 01:53:56 INFO mapred.JobClient: Cleaning up the staging area hdfs://mikes.cl.pirho.com:8020/user/hdfs/.staging/job_201309140041_0002
13/09/14 01:53:56 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://mikes.cl.pirho.com:8020/user/cloudera/wordcount/output already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://mikes.cl.pirho.com:8020/user/cloudera/wordcount/output already exists
        at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:117)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:986)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:919)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1368)
        at org.myorg.WordCount.main(WordCount.java:55)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Must clean up the output directory fiirst…

$ hadoop fs -rmr /user/cloudera/wordcount/output
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://mikes.cl.pirho.com:8020/user/cloudera/wordcount/output' to trash at: hdfs://mikes.cl.pirho.com:8020/user/hdfs/.Trash/Current

Now we should be able to re-run the job…

$ hadoop jar /tmp/wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output
13/09/14 01:56:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/09/14 01:56:53 INFO mapred.FileInputFormat: Total input paths to process : 2
13/09/14 01:56:53 INFO mapred.JobClient: Running job: job_201309140041_0003
13/09/14 01:56:54 INFO mapred.JobClient:  map 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304}
13/09/14 01:57:07 INFO mapred.JobClient:  map 33{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304}
13/09/14 01:57:10 INFO mapred.JobClient:  map 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304}
13/09/14 01:57:15 INFO mapred.JobClient:  map 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 50{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304}
13/09/14 01:57:21 INFO mapred.JobClient:  map 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304}
13/09/14 01:57:26 INFO mapred.JobClient: Job complete: job_201309140041_0003
13/09/14 01:57:26 INFO mapred.JobClient: Counters: 33
13/09/14 01:57:26 INFO mapred.JobClient:   File System Counters
13/09/14 01:57:26 INFO mapred.JobClient:     FILE: Number of bytes read=98
13/09/14 01:57:26 INFO mapred.JobClient:     FILE: Number of bytes written=782884
13/09/14 01:57:26 INFO mapred.JobClient:     FILE: Number of read operations=0
13/09/14 01:57:26 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/09/14 01:57:26 INFO mapred.JobClient:     FILE: Number of write operations=0
13/09/14 01:57:26 INFO mapred.JobClient:     HDFS: Number of bytes read=410
13/09/14 01:57:26 INFO mapred.JobClient:     HDFS: Number of bytes written=41
13/09/14 01:57:26 INFO mapred.JobClient:     HDFS: Number of read operations=8
13/09/14 01:57:26 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/09/14 01:57:26 INFO mapred.JobClient:     HDFS: Number of write operations=4
13/09/14 01:57:26 INFO mapred.JobClient:   Job Counters
13/09/14 01:57:26 INFO mapred.JobClient:     Launched map tasks=3
13/09/14 01:57:26 INFO mapred.JobClient:     Launched reduce tasks=2
13/09/14 01:57:26 INFO mapred.JobClient:     Data-local map tasks=3
13/09/14 01:57:26 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=26405
13/09/14 01:57:26 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=15541
13/09/14 01:57:26 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/09/14 01:57:26 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/09/14 01:57:26 INFO mapred.JobClient:   Map-Reduce Framework
13/09/14 01:57:26 INFO mapred.JobClient:     Map input records=2
13/09/14 01:57:26 INFO mapred.JobClient:     Map output records=8
13/09/14 01:57:26 INFO mapred.JobClient:     Map output bytes=82
13/09/14 01:57:26 INFO mapred.JobClient:     Input split bytes=357
13/09/14 01:57:26 INFO mapred.JobClient:     Combine input records=8
13/09/14 01:57:26 INFO mapred.JobClient:     Combine output records=6
13/09/14 01:57:26 INFO mapred.JobClient:     Reduce input groups=5
13/09/14 01:57:26 INFO mapred.JobClient:     Reduce shuffle bytes=165
13/09/14 01:57:26 INFO mapred.JobClient:     Reduce input records=6
13/09/14 01:57:26 INFO mapred.JobClient:     Reduce output records=5
13/09/14 01:57:26 INFO mapred.JobClient:     Spilled Records=12
13/09/14 01:57:26 INFO mapred.JobClient:     CPU time spent (ms)=5450
13/09/14 01:57:26 INFO mapred.JobClient:     Physical memory (bytes) snapshot=679362560
13/09/14 01:57:26 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3287724032
13/09/14 01:57:26 INFO mapred.JobClient:     Total committed heap usage (bytes)=345509888
13/09/14 01:57:26 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
13/09/14 01:57:26 INFO mapred.JobClient:     BYTES_READ=50

Success!

Lets look at the output:

$ hadoop fs -ls /user/cloudera/wordcount/output
Found 4 items
-rw-r--r--   3 hdfs supergroup          0 2013-09-14 01:57 /user/cloudera/wordcount/output/_SUCCESS
drwxr-xr-x   - hdfs supergroup          0 2013-09-14 01:56 /user/cloudera/wordcount/output/_logs
-rw-r--r--   3 hdfs supergroup         19 2013-09-14 01:57 /user/cloudera/wordcount/output/part-00000
-rw-r--r--   3 hdfs supergroup         22 2013-09-14 01:57 /user/cloudera/wordcount/output/part-00001
$ hadoop fs -tail /user/cloudera/wordcount/output/part-00000
Goodbye 1
Hadoop  2
$ hadoop fs -tail /user/cloudera/wordcount/output/part-00001
Bye     1
Hello   2
World   2

I haven’t even read through the code to understand what this simple little job is doing – wanted to get a simple job to function on my new cluster and figure out how to run jobs. Now that some of the formalities are out of the way I’ll be running some load testing and some more interesting jobs.

One Comment
Leave a reply

Your email address will not be published. Required fields are marked *