read parquet file from s3 using pyspark issue

Ask Time：2019-09-26T23:45:03 Author：Moshik Mishaeli

I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working). No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues. as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)

Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)

the following configuration works only for small files, but using the follwing sparkSession config:

s3 config:

spark = SparkSession.builder.appName('JSON2parquet')\
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
            .config('fs.s3.awsAccessKeyId', myAccessId')\
            .config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
            .config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
            .config("spark.sql.parquet.filterPushdown", "true")\
            .config("spark.sql.parquet.mergeSchema", "false")\
            .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
            .config("spark.speculation", "false")\
            .getOrCreate()

s3a config:

spark = SparkSession.builder.appName('JSON2parquet')\
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
            .config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
            .config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
            .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
            .config("spark.sql.parquet.filterPushdown", "true")\
            .config("spark.sql.parquet.mergeSchema", "false")\
            .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
            .config("spark.speculation", "false")\
            .getOrCreate()

JARs for s3 read-write (spark.driver.extraClassPath):

hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar

Is there any other .config I can use to solve this issue?

Thanks, Mosh.

Author:Moshik Mishaeli，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/58120433/read-parquet-file-from-s3-using-pyspark-issue

read parquet file from s3 using pyspark issue