Home:ALL Converter>How to define parquet schema for ParquetOutputFormat for Hadoop job in java?

How to define parquet schema for ParquetOutputFormat for Hadoop job in java?

Ask Time:2017-03-16T20:44:06         Author:Viacheslav Shalamov

Json Formatter

I have a Hadoop job in java, which has sequence output format:

job.setOutputFormatClass(SequenceFileOutputFormat.class);

I want to use Parquet format instead. I tried to set it in the naive way:

job.setOutputFormatClass(ParquetOutputFormat.class);
ParquetOutputFormat.setOutputPath(job, output);
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
ParquetOutputFormat.setCompressOutput(job, true);

But when in comes to writing job's result to disk, the bob fails:

Error: java.lang.NullPointerException: writeSupportClass should not be null
    at parquet.Preconditions.checkNotNull(Preconditions.java:38)
    at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)

It seems, that parquet needs a schema te be set, but I couldn't find ane manual or guide, how to do that in my case. My Reducer class tries to write down 3 long values on each line by using org.apache.hadoop.io.LongWritable as a key and org.apache.mahout.cf.taste.hadoop.EntityEntityWritable as a value.

How can I define a schema for that?

Author:Viacheslav Shalamov,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/42834468/how-to-define-parquet-schema-for-parquetoutputformat-for-hadoop-job-in-java
yy