Home:ALL Converter>Clustering of key-value pairs

Clustering of key-value pairs

Ask Time:2013-07-17T02:32:03         Author:user2365015

Json Formatter

I have this problem. I have a very large set (in millions) of key-value pairs with a certain unique id as a key and a string as a value (the strings might be exactly similar for 2 or more keys). I have to group these key-value pairs together as group 1 contains some id-string pairs group 2 contains some other pairs etc. The grouping needs to be done on the similarity between the strings which are actually values of the pairs. I have already implemented Levenshtein Distance between these strings and grouped the pairs with distance less than a threshold distance together. And I have implemented it the traditional (very bad) way: compare each string with every other.

I need some tips on how to optimize this. Can I actually group key-value pairs together using Map-Reduce in Hadoop? I think the input for map and reduce functions are individual and independent and hence can't be 'grouped' together. And is this a k-means clustering problem? Can you suggest some other faster and efficient techniques? Thanks.

Author:user2365015,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/17684347/clustering-of-key-value-pairs
yy