Home:ALL Converter>DistCp from Local Hadoop to Amazon S3

DistCp from Local Hadoop to Amazon S3

Ask Time:2013-06-01T00:02:32         Author:Mohamed

Json Formatter

I'm trying to use distcp to copy a folder from my local hadoop cluster (cdh4) to my Amazon S3 bucket.

I use the following command:

hadoop distcp -log /tmp/distcplog-s3/ hdfs://nameserv1/tmp/data/sampledata  s3n://hdfsbackup/

hdfsbackup is the name of my Amazon S3 Bucket.

DistCp fails with unknown host exception:

13/05/31 11:22:33 INFO tools.DistCp: srcPaths=[hdfs://nameserv1/tmp/data/sampledata]
13/05/31 11:22:33 INFO tools.DistCp: destPath=s3n://hdfsbackup/
        No encryption was performed by peer.
        No encryption was performed by peer.
13/05/31 11:22:35 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 54 for hadoopuser on ha-hdfs:nameserv1
13/05/31 11:22:35 INFO security.TokenCache: Got dt for hdfs://nameserv1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameserv1, Ident: (HDFS_DELEGATION_TOKEN token 54 for hadoopuser)
        No encryption was performed by peer.
java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfsbackup
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
    at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:295)
    at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:282)
    at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:503)
    at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:487)
    at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:130)
    at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:111)
    at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:85)
    at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1046)
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
Caused by: java.net.UnknownHostException: hdfsbackup
    ... 14 more

I have the AWS ID/Secret configured in the core-site.xml of all nodes.

<!-- Amazon S3 -->
<property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>MY-ID</value>
</property>

<property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>MY-SECRET</value>
</property>


<!-- Amazon S3N -->
<property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>MY-ID</value>
</property>

<property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>MY-SECRET</value>
</property>

I'm able to copy files from hdfs using the cp command without any problem. The below command successfully copied the hdfs folder to S3

hadoop fs -cp hdfs://nameserv1/tmp/data/sampledata  s3n://hdfsbackup/

I know there is Amazon S3 optimized distcp (s3distcp) available, but I don't want to use it as it doesn't support update/overwrite options.

Author:Mohamed,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/16861458/distcp-from-local-hadoop-to-amazon-s3
yy