I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper.
Does the following logic make sense (and instead of hard coding, can I pass the directory to Hadoop as e.g. -input)?
lotsdir= 'hdfs://localhost:54310/user/hduser/randomimages/'
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
imagehashes={}
for fname in files:
imagehashes[fname]=pHash.imagehash( fname )
Yann :
Yes, the logic makes sense. \n\nBut you will very likely have an performance issue since your input files are not in text format, so they will not be properly split on HDFS. \n\nHopefully, Hadoop provides several ways to fix that issue. For instance, you could either:\n\n\nconvert your image files into SequenceFile and store them into the HDFS\nwrite your own InputFormat, OutputFormat and RecordReader in order to split them properly\n",
2015-06-08T07:30:24