Today I sat down to go through the simple Cloudera Hadoop WordCount Tutorial on a newly installed CDH cluster and I ran into a few noob issues. This took a lot longer than it should have – hopefully this will help someone else avoid a few pitfalls
The following is based on attempting to get the Example: WordCount v1.0 running.
I had no issues creating the files from the example, the input files and the java files. I started having problems when I started looking for the java dependencies and they continued from there. This is not being done on an image, this is a fully functional 4 node cluster. This is being done as root from the root home directory (this is important information for context later on).
$ javac -cp /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop/\*:/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/\* -d wordcount_classes WordCount.java $ jar cf wordcount.jar -C wordcount_classes .
$ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input
mkdir: Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:supergroup:drwxr-xr-x
mkdir: Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:supergroup:drwxr-xr-x
mkdir: Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:supergroup:drwxr-xr-x
So, as root I don’t have access to the hdfs filesystem…
$ su hdfs $ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input $ hadoop fs -put file* /user/cloudera/wordcount/input put: `file0': No such file or directory put: `file1': No such file or directory
Since we are hdfs we need to move the file somewhere we have access to it…
$ cp file* /tmp/ $ hadoop fs -put /tmp/file* /user/cloudera/wordcount/input $ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output . . . 13/09/14 01:50:32 INFO mapred.JobClient: Task Id : attempt_201309140041_0001_m_000000_0, Status : FAILED java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.myorg.WordCount$Reduce not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649) at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1073) at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1354) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:857) at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:376) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:406) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundExceptio . . . Lots of java exceptions . .
Evidently at the current location I don’t have permission as hdfs…
$ cp wordcount.jar /tmp/
Let’s try this again…
$ hadoop jar /tmp/wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output 13/09/14 01:53:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/09/14 01:53:56 INFO mapred.JobClient: Cleaning up the staging area hdfs://mikes.cl.pirho.com:8020/user/hdfs/.staging/job_201309140041_0002 13/09/14 01:53:56 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://mikes.cl.pirho.com:8020/user/cloudera/wordcount/output already exists Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://mikes.cl.pirho.com:8020/user/cloudera/wordcount/output already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:117) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:986) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:919) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1368) at org.myorg.WordCount.main(WordCount.java:55) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Must clean up the output directory fiirst…
$ hadoop fs -rmr /user/cloudera/wordcount/output rmr: DEPRECATED: Please use 'rm -r' instead. Moved: 'hdfs://mikes.cl.pirho.com:8020/user/cloudera/wordcount/output' to trash at: hdfs://mikes.cl.pirho.com:8020/user/hdfs/.Trash/Current
Now we should be able to re-run the job…
$ hadoop jar /tmp/wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output 13/09/14 01:56:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/09/14 01:56:53 INFO mapred.FileInputFormat: Total input paths to process : 2 13/09/14 01:56:53 INFO mapred.JobClient: Running job: job_201309140041_0003 13/09/14 01:56:54 INFO mapred.JobClient: map 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} 13/09/14 01:57:07 INFO mapred.JobClient: map 33{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} 13/09/14 01:57:10 INFO mapred.JobClient: map 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 0{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} 13/09/14 01:57:15 INFO mapred.JobClient: map 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 50{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} 13/09/14 01:57:21 INFO mapred.JobClient: map 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} reduce 100{80e463235c561985fcb9d065cb7af58becf1df7010d7a45bb4eb7315e5a8b304} 13/09/14 01:57:26 INFO mapred.JobClient: Job complete: job_201309140041_0003 13/09/14 01:57:26 INFO mapred.JobClient: Counters: 33 13/09/14 01:57:26 INFO mapred.JobClient: File System Counters 13/09/14 01:57:26 INFO mapred.JobClient: FILE: Number of bytes read=98 13/09/14 01:57:26 INFO mapred.JobClient: FILE: Number of bytes written=782884 13/09/14 01:57:26 INFO mapred.JobClient: FILE: Number of read operations=0 13/09/14 01:57:26 INFO mapred.JobClient: FILE: Number of large read operations=0 13/09/14 01:57:26 INFO mapred.JobClient: FILE: Number of write operations=0 13/09/14 01:57:26 INFO mapred.JobClient: HDFS: Number of bytes read=410 13/09/14 01:57:26 INFO mapred.JobClient: HDFS: Number of bytes written=41 13/09/14 01:57:26 INFO mapred.JobClient: HDFS: Number of read operations=8 13/09/14 01:57:26 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/09/14 01:57:26 INFO mapred.JobClient: HDFS: Number of write operations=4 13/09/14 01:57:26 INFO mapred.JobClient: Job Counters 13/09/14 01:57:26 INFO mapred.JobClient: Launched map tasks=3 13/09/14 01:57:26 INFO mapred.JobClient: Launched reduce tasks=2 13/09/14 01:57:26 INFO mapred.JobClient: Data-local map tasks=3 13/09/14 01:57:26 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=26405 13/09/14 01:57:26 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=15541 13/09/14 01:57:26 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/09/14 01:57:26 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/09/14 01:57:26 INFO mapred.JobClient: Map-Reduce Framework 13/09/14 01:57:26 INFO mapred.JobClient: Map input records=2 13/09/14 01:57:26 INFO mapred.JobClient: Map output records=8 13/09/14 01:57:26 INFO mapred.JobClient: Map output bytes=82 13/09/14 01:57:26 INFO mapred.JobClient: Input split bytes=357 13/09/14 01:57:26 INFO mapred.JobClient: Combine input records=8 13/09/14 01:57:26 INFO mapred.JobClient: Combine output records=6 13/09/14 01:57:26 INFO mapred.JobClient: Reduce input groups=5 13/09/14 01:57:26 INFO mapred.JobClient: Reduce shuffle bytes=165 13/09/14 01:57:26 INFO mapred.JobClient: Reduce input records=6 13/09/14 01:57:26 INFO mapred.JobClient: Reduce output records=5 13/09/14 01:57:26 INFO mapred.JobClient: Spilled Records=12 13/09/14 01:57:26 INFO mapred.JobClient: CPU time spent (ms)=5450 13/09/14 01:57:26 INFO mapred.JobClient: Physical memory (bytes) snapshot=679362560 13/09/14 01:57:26 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3287724032 13/09/14 01:57:26 INFO mapred.JobClient: Total committed heap usage (bytes)=345509888 13/09/14 01:57:26 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/09/14 01:57:26 INFO mapred.JobClient: BYTES_READ=50
Success!
Lets look at the output:
$ hadoop fs -ls /user/cloudera/wordcount/output Found 4 items -rw-r--r-- 3 hdfs supergroup 0 2013-09-14 01:57 /user/cloudera/wordcount/output/_SUCCESS drwxr-xr-x - hdfs supergroup 0 2013-09-14 01:56 /user/cloudera/wordcount/output/_logs -rw-r--r-- 3 hdfs supergroup 19 2013-09-14 01:57 /user/cloudera/wordcount/output/part-00000 -rw-r--r-- 3 hdfs supergroup 22 2013-09-14 01:57 /user/cloudera/wordcount/output/part-00001 $ hadoop fs -tail /user/cloudera/wordcount/output/part-00000 Goodbye 1 Hadoop 2 $ hadoop fs -tail /user/cloudera/wordcount/output/part-00001 Bye 1 Hello 2 World 2
I haven’t even read through the code to understand what this simple little job is doing – wanted to get a simple job to function on my new cluster and figure out how to run jobs. Now that some of the formalities are out of the way I’ll be running some load testing and some more interesting jobs.
[…] CDH-4.4.0-1 Tutorials […]