Home:ALL Converter>Hadoop distcp to S3 performance is very slow

Hadoop distcp to S3 performance is very slow

Ask Time:2019-12-06T01:32:53         Author:Hemanth

Json Formatter

I am trying to copy the data from HDFS to Amazon S3 using hadoop distcp. the amount of data is 227GB and the job has been running for more than 12 hours.

Is there a hard limit of 3500 write requests for a S3 bucket ? and could this be causing the slowdown? Is there a workaround for this? Or cloud the performance be increased in any other way?

Below is my command:

hadoop distcp -Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider -Dfs.s3a.access.key=KEY -Dfs.s3a.secret.key=SECRET -Dfs.s3a.session.token=TOKEN -Dfs.s3a.server-side-encryption-algorithm=SSE-KMS -Dfs.s3a.server-side-encryption-key=enc-key -Dmapreduce.job.queuename=default -Ddistcp.dynamic.split.ratio=4 -Ddistcp.dynamic.recordsPerChunk=25 -Ddistcp.dynamic.max.chunks.tolerable=20000 -strategy dynamic -i -numListstatusThreads 40 -m 300 -update -delete /data/prod/hdp/brm s3a://bucket/data/prod/hdp/brm

There are a lot of small files. the average size of file is ~300KB. I had to launch the job twice, the first time it failed with a lot of mappers throwing errors like this:

Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: getFileStatus on s3a://bucket/data/prod/hdp/brm/.distcp.tmp.attempt_1574118601834_3172_m_000000_0: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;

then I realized having more prefixes would help and launched a new job that went a couple of levels deeper (/data/prod/hdp/brm to /data/prod/hdp/brm/dataout/enabled) because /data/prod/hdp/brm/dataout/enabled had like 10 directories which I thought would increase the write requests. The job is running without any issues now, but the performance is really bad.

Any help would be appreciated. Thank you.

Author:Hemanth,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/59200453/hadoop-distcp-to-s3-performance-is-very-slow
yy