Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.
I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.
def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]
Then replace the yield
with the actual action you want to take on each n-gram (add it to a dict
, store it in a database, whatever) to get rid of the generator overhead.
Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict
instead of yield
:
def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens
count = defaultdict(int)
join_spaces = " ".join
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1
return count
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…