Home:ALL Converter>Read parquet data from Azure Blob container without downloading it locally

Read parquet data from Azure Blob container without downloading it locally

Ask Time:2019-05-08T00:00:38         Author:Limmy

Json Formatter

I'm using azure SDK, avro-parquet and hadoop libraries to read a parquet file from Blob Container. Currently, I'm downloading file to the temp file, and then create a ParquetReader.

try (InputStream input = blob.openInputStream()) {
                Path tmp = Files.createTempFile("tempFile", ".parquet");

                Files.copy(input, tmp, StandardCopyOption.REPLACE_EXISTING);
                IOUtils.closeQuietly(input);
                InputFile file = HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(tmp.toFile().getPath()),
                        new Configuration());
                ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord> builder(file).build();

                GenericRecord record;
                while ((record = reader.read()) != null) {
                    recordList.add(record);
                }
            } catch (IOException | StorageException e) {
                log.error(e.getMessage(), e);
            }

I want to read this file using inputStream from azure blob item, without downloading it to my machine. There's such way for S3 ( Read parquet data from AWS s3 bucket), but does this possibility exist for Azure?

Author:Limmy,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/56026462/read-parquet-data-from-azure-blob-container-without-downloading-it-locally
yy