I have a Hadoop job in java, which has sequence output format:
job.setOutputFormatClass(SequenceFileOutputFormat.class);
I want to use Parquet format instead. I tried to set it in the naive way:
job.setOutputFormatClass(ParquetOutputFormat.class);
ParquetOutputFormat.setOutputPath(job, output);
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
ParquetOutputFormat.setCompressOutput(job, true);
But when in comes to writing job's result to disk, the bob fails:
Error: java.lang.NullPointerException: writeSupportClass should not be null
at parquet.Preconditions.checkNotNull(Preconditions.java:38)
at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)
It seems, that parquet needs a schema te be set, but I couldn't find ane manual or guide, how to do that in my case.
My Reducer
class tries to write down 3 long values on each line by using org.apache.hadoop.io.LongWritable
as a key and org.apache.mahout.cf.taste.hadoop.EntityEntityWritable
as a value.
How can I define a schema for that?