I have this requirement, Connect to S3 and read Parquet file and its contents in Java.
I have used hadoop way of doing it and it works.
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version> 3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version> 3.3.1</version>
</dependency>
However the above dependencies are bringing in Log4j 1.2.17 version which is vulnerable.
apache hadoop version 3.3.1 was released which is the latest in June 2021 which is before Log4j vulnerability issue popped up.
Anyone know what is the work around?
Can the following requirement can be achieved without hadoop dependency?
Here is my code that does the job
ParquetFileMetaData parquetFileMetaData = null;
String filePath = "s3a://" + bucket + parquetFilePath;
Path path = new Path(filePath);
ParquetMetadata readFooter = null;
try {
readFooter = ParquetFileReader.readFooter(config, path, ParquetMetadataConverter.NO_FILTER);
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader parquetFileReader = new ParquetFileReader(config, path, readFooter);
parquetFileMetaData = new ParquetFileMetaData();
parquetFileMetaData.setSchema(schema);
parquetFileMetaData.setParquetFileReader(parquetFileReader);
} catch (IOException e) {
e.printStackTrace();
}
My S3 is not Amazon S3.