Home:ALL Converter>How to debug hadoop mapreduce jobs from eclipse?

How to debug hadoop mapreduce jobs from eclipse?

Ask Time:2012-03-29T05:18:52         Author:sangfroid

Json Formatter

I'm running hadoop in a single-machine, local-only setup, and I'm looking for a nice, painless way to debug mappers and reducers in eclipse. Eclipse has no problem running mapreduce tasks. However, when I go to debug, it gives me this error :

12/03/28 14:03:23 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

Okay, so I do some research. Apparently, I should use eclipse's remote debugging facility, and add this to my hadoop-env.sh :

-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000

I do that and I can step through my code in eclipse. Only problem is that, because of the "suspend=y", I can't use the "hadoop" command from the command line to do things like look at the job queue; it hangs, I'm imagining because it's waiting for a debugger to attach. Also, I can't run "hbase shell" when I'm in this mode, probably for the same reason.

So basically, if I want to flip back and forth between "debug mode" and "normal mode", I need to update hadoop-env.sh and restart my machine. Major pain. So I have a few questions :

  1. Is there an easier way to do debug mapreduce jobs in eclipse?

  2. How come eclipse can run my mapreduce jobs just fine, but for debugging I need to use remote debugging?

  3. Is there a way to tell hadoop to use remote debugging for mapreduce jobs, but to operate in normal mode for all other tasks? (such as "hadoop queue" or "hbase shell").

  4. Is there an easier way to switch hadoop-env.sh configurations without rebooting my machine? hadoop-env.sh is not executable by default.

  5. This is a more general question : what exactly is happening when I run hadoop in local-only mode? Are there any processes on my machine that are "always on" and executing hadoop jobs? Or does hadoop only do things when I run the "hadoop" command from the command line? What is eclipse doing when I run a mapreduce job from eclipse? I had to reference hadoop-core in my pom.xml in order to make my project work. Is eclipse submitting jobs to my installed hadoop instance, or is it somehow running it all from the hadoop-core-1.0.0.jar in my maven cache?

Here is my Main class :

public class Main {
      public static void main(String[] args) throws Exception {     
        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("FirstStage");

        FileInputFormat.addInputPath(job, new Path("/home/sangfroid/project/in"));
        FileOutputFormat.setOutputPath(job, new Path("/home/sangfroid/project/out"));

        job.setMapperClass(FirstStageMapper.class);
        job.setReducerClass(FirstStageReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
}

Author:sangfroid,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/9915808/how-to-debug-hadoop-mapreduce-jobs-from-eclipse
Kapil D :

The only way you can debug hadoop in eclipse is running hadoop in local mode. The reason being, each map reduce task run in ist own JVM and when you don't hadoop in local mode, eclipse won't be able to debug. \n\nWhen you set hadoop to local mode, instead of using hdfs API(which is default), hadoop file system changes to file:///. Thus, running hadoop fs -ls will not be a hdfs command, but more of hadoop fs -ls file:///, a path to your local directory. None of the JobTracker or NameNode runs. \n\nThese blogposts might help:\n\n\nhttp://let-them-c.blogspot.com/2011/07/running-hadoop-locally-on-eclipse.html\nhttp://let-them-c.blogspot.com/2011/07/configurations-of-running-hadoop.html\n",
2012-06-12T00:47:26
Jaime Garza :

Besides the recommended MRUnit I like to debug with eclipse as well. I have a main program. It instantiates a Configuration and executes the MapReduce job directly. I just debug with standard eclipse Debug configurations. Since I include hadoop jars in my mvn spec, I have all hadoop per se in my class path and I have no need to run it against my installed hadoop. I always test with small data sets in local directories to make things easy. The defaults for the configuration behaves as a stand alone hadoop (file system is available)",
2012-03-29T16:00:32
llewellyn falco :

I also like to debug via unit test w/MRUnit. I will use this in combination with approvaltests which creates an easy visualization of the Map Reduce process, and makes it easy to pass in scenarios that are failing. It also runs seamlessly from eclipse.\n\nFor example:\n\nHadoopApprovals.verifyMapReduce(new WordCountMapper(), \n new WordCountReducer(), 0, \"cat cat dog\");\n\n\nWill produce the output:\n\n[cat cat dog] \n-> maps via WordCountMapper to ->\n(cat, 1) \n(cat, 1) \n(dog, 1)\n\n-> reduces via WordCountReducer to ->\n(cat, 2) \n(dog, 1)\n\n\nThere's a video on the process here: http://t.co/leExFVrf",
2012-09-20T18:31:10
Honza :

Adding args to hadoop's internal java command can be done via HADOOP_OPTS env variable:\n\nexport HADOOP_OPTS=\"-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y\"\n",
2019-01-18T11:48:46
Jagdeep Singh :

Make changes in /bin/hadoop (hadoop-env.sh) script. Check to see what command has been fired. If the command is jar, then only add remote debug configuration. \n\nif [ \"$COMMAND\" = \"jar\" ] ; then\n exec \"$JAVA\" -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999 $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS \"$@\"\nelse\n exec \"$JAVA\" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS \"$@\"\nfi\n",
2012-10-23T11:13:51
yy