I am trying to copy the data from HDFS to Amazon S3 using hadoop distcp
. the amount of data is 227GB and the job has been running for more than 12 hours.
Is there a hard limit of 3500 write requests for a S3 bucket ? and could this be causing the slowdown? Is there a workaround for this? Or cloud the performance be increased in any other way?
Below is my command:
hadoop distcp -Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider -Dfs.s3a.access.key=KEY -Dfs.s3a.secret.key=SECRET -Dfs.s3a.session.token=TOKEN -Dfs.s3a.server-side-encryption-algorithm=SSE-KMS -Dfs.s3a.server-side-encryption-key=enc-key -Dmapreduce.job.queuename=default -Ddistcp.dynamic.split.ratio=4 -Ddistcp.dynamic.recordsPerChunk=25 -Ddistcp.dynamic.max.chunks.tolerable=20000 -strategy dynamic -i -numListstatusThreads 40 -m 300 -update -delete /data/prod/hdp/brm s3a://bucket/data/prod/hdp/brm
There are a lot of small files. the average size of file is ~300KB. I had to launch the job twice, the first time it failed with a lot of mappers throwing errors like this:
Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: getFileStatus on s3a://bucket/data/prod/hdp/brm/.distcp.tmp.attempt_1574118601834_3172_m_000000_0: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;
then I realized having more prefixes would help and launched a new job that went a couple of levels deeper (/data/prod/hdp/brm
to /data/prod/hdp/brm/dataout/enabled
) because /data/prod/hdp/brm/dataout/enabled
had like 10 directories which I thought would increase the write requests. The job is running without any issues now, but the performance is really bad.
Any help would be appreciated. Thank you.