Home:ALL Converter>Hadoop distcp -possible to keep each file identical (retain file size)?

Hadoop distcp -possible to keep each file identical (retain file size)?

Ask Time:2017-06-18T16:18:08         Author:pl0u

Json Formatter

When I run a simple distcp command:

hadoop distcp s3://src-bucket/src-dir s3://dest-bucket/dest-dir 

I get a slight discrepancy on the size (in bytes) of src-dir and dest-dir

>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
   Total Size: 64911104881181

>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
   Total Size: 64901040284124

My Question is:

  1. What could have introduced this discrepancy? Is the content of my dest dir still the same as the original?
  2. Most importantly - are there parameters I can set to ensure each file looks exactly the same as their src counter-part (ie same file size)?

Author:pl0u,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/44613038/hadoop-distcp-possible-to-keep-each-file-identical-retain-file-size
Chris Nauroth :

\n \n What could have introduced this discrepancy? Is the content of my dest dir still the same as the original?\n \n\n\nIs it possible that there was a concurrent write activity happening in src-dir at the same time that the DistCp was running? For example, was there a file open for write in src-dir by some other application, and the application was writing content to the file while the DistCp ran?\n\nEventual consistency effects at S3 also can come into play, particularly around updates of existing objects. If an application overwrites an existing object, then there is a window of time afterward where applications reading that object might see the old version of the object, or they might see the new version. More details on this are available in the AWS documentation of the Amazon S3 Data Consistency Model.\n\n\n \n Most importantly - are there parameters I can set to ensure each file looks exactly the same as their src counter-part (ie same file size)?\n \n\n\nIn general, DistCp will perform a CRC check of each source file against the new copy at the destination to confirm that it was copied correctly. I noticed your are using the S3 file system instead of HDFS though. For S3, like many of the alternative file systems, there is a limitation that this CRC verification cannot be performed.\n\nAs an added note, the S3FileSystem (URIs with s3:// for the scheme) is effectively deprecated, unmaintained by the Apache Hadoop community and poorly supported. If possible, we recommend that users migrate to S3AFileSystem (URIs with s3a:// for the scheme) for improved features, performance and support. There are more details Integration with Amazon Web Services documentation for more details.\n\nIf you cannot find an explanation for the behavior you are seeing with s3://, then it is possible there is a bug lurking there, and you might be better served trying s3a://. (If you have existing data that was already written using s3:// though, then you'd need to figure out some kind of migration for that data first, such as by copying from an s3:// URI to an equivalent s3a:// URI.)",
2017-06-19T17:25:11
yy