Home:ALL Converter>Hadoop streaming accessing files in a directory

Hadoop streaming accessing files in a directory

Ask Time:2014-07-04T22:03:25         Author:schoon

Json Formatter

I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper. Does the following logic make sense (and instead of hard coding, can I pass the directory to Hadoop as e.g. -input)?

lotsdir= 'hdfs://localhost:54310/user/hduser/randomimages/' 
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()

imagehashes={}
for fname in files:
    imagehashes[fname]=pHash.imagehash( fname )

Author:schoon,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/24576064/hadoop-streaming-accessing-files-in-a-directory
Yann :

Yes, the logic makes sense. \n\nBut you will very likely have an performance issue since your input files are not in text format, so they will not be properly split on HDFS. \n\nHopefully, Hadoop provides several ways to fix that issue. For instance, you could either:\n\n\nconvert your image files into SequenceFile and store them into the HDFS\nwrite your own InputFormat, OutputFormat and RecordReader in order to split them properly\n",
2015-06-08T07:30:24
yy