The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.
Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.
There's a great implementation in the Python datasketch library (pip install datasketch
). Here's an example that shows you can catch fuzzy string similarity:
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'weights controls the relative importance between minizing false positive',
'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]
# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)
# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash
for i in xrange(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result
This returns:
Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…