The Problem
I've been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comparing their efficiency. As far as I can tell, in the normal case with an unsorted list, the Counter
method from the collections
module is superior to anything I've come up with or found elsewhere, but it doesn't seem to take much advantage of the benefits of a pre-sorted list and I've come up with methods that beat it easily in this special case. So, in short, is there any built-in way to inform Counter
of the fact that a list is already sorted to further speed it up?
(The next section is on unsorted lists where Counter works magic; you may want to skip to towards the end where it looses its charm when dealing with sorted lists.)
Unsorted input lists
One approach that doesn't work
The naive approach would be to use sorted([(word, corpus.count(word)) for word in set(corpus)])
, but that one reliably gets you into runtime problems as soon as your corpus is a few thousand items long - not surprisingly since you're running through the entire list of n words m many times, where m is the number of unique words.
Sorting the list + local search
So what I tried to do instead before I found Counter
was make sure that all searches are strictly local by first sorting the input list (I also have to remove digits and punctuation marks and convert all entries into lower case to avoid duplicates like 'foo', 'Foo', and 'foo:').
#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk
# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words())
import string
drop = string.punctuation+string.digits
def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
'''function for extracting word frequencies out of a corpus. Returns an alphabetic list
of tuples consisting of words contained in the corpus with their frequencies.
Default is case-insensitive, but if you need separate entries for upper and lower case
spellings of the same words, set option Case=True. If your input list is already sorted
and stripped of punctuation marks/digits and/or all lower case, you can accelerate the
operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
# you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
if lower or Case:
if StrippedSorted:
sortedc = corpus
else:
sortedc = sorted([word.replace('--',' ').strip(drop)
for word in sorted(corpus)])
# here we sort and purge the input list in the default case:
else:
sortedc = sorted([word.lower().replace('--',' ').strip(drop)
for word in sorted(corpus)])
# start iterating over the (sorted) input list:
scindex = 0
# create a list:
freqs = []
# identify the first token:
currentword = sortedc[0]
length = len(sortedc)
while scindex < length:
wordcount = 0
# increment a local counter while the tokens == currentword
while scindex < length and sortedc[scindex] == currentword:
scindex += 1
wordcount += 1
# store the current word and final score when a) a new word appears or
# b) the end of the list is reached
freqs.append((currentword, wordcount))
# if a): update currentword with the current token
if scindex < length:
currentword = sortedc[scindex]
return freqs
Finding collections.Counter
This is much better, but still not quite as fast as using the Counter class from the collections module, which creates a dictionary of {word: frequency of word} entries (we still have to do the same stripping and lowering, but no sorting):
from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
On the Gutenberg corpus with appr. 2 million entries, the Counter method is roughly 30% faster on my machine (5 seconds as opposed to 7.2), which is mostly explained through the sorting routine which eats around 2.1 seconds (if you don't have and don't want to install the nltk package (Natural Language Toolkit) which offers access to this corpus, any other adequately long text appropriately split into a list of strings at word level will show you the same.)
Comparing performance
With my idiosyncratic method of timing using the tautology as a conditional to delay execution, this gives us for the counter method:
import time
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
... cnt[word] += 1
... time.time()-start
... cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
... time.time()-start
...
4.999882936477661
5.191655874252319
(We see that the last step, that of formatting the results as a list of tuples, takes up less than 5% of the total time.)
Compared to my function:
>>> if 1:
... start = time.time()
... gbfreqs = wordcount5(corpus)
... time.time()-start
...
7.261770963668823
Sorted input lists - when Counter
'fails'
However, as you may have noticed, my function allows to specify that the input is already sorted, stripped of punctuational garbage, and converted to lowercase. If we already have created such a converted version of the list for some other operations, using it (and declaring so) can very much speed up the operation of my wordcount5
:
>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
... start = time.time()
... strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
... time.time()-start
...
0.9050078392028809
Here, we've reduced runtime by a factor of appr. 8 by not having to sort the corpus and convert the items. Of course the latter is also true when feeding Counter
with this new list, so expectably it's also a bit faster, but it doesn't seem to take advantage of the fact that it is sorted and now takes twice as long as my function where it was 30% faster before:
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in sorted_corpus:
... cnt[word] += 1
... time.time()-start
... strippedgbfreqs = [(word, cnt[word])for word in cnt]
... time.time()-start
...
1.9455058574676514
2.0096349716186523
Of course, we can use the same logic I used in wordcount5
- incrementing a local counter until we run into a new word and only then storing the last word with the current state of the counter, and resetting the counter to 0 for the next word - only using Counter
as storage, but the inherent efficiency of the Counter
method seems lost, and performance is within the range of my function's for creating a dictionary, with the extra burden of converting to a list of tuples now looking more troublesome than it used to when we were processing the raw corpus:
>>> def countertest():
... start = time.time()
... sortedcnt = Counter()
... c = 0
... length = len(sorted_corpus)
... while c < length:
... wcount = 0
... word = sorted_corpus[c]
... while c < length and sorted_corpus[c] == word:
... wcount+=1
... c+=1
... sortedcnt[word] = wcount
... if c < length:
... word = sorted_corpus[c]
... print time.time()-start
... return sorted([(word, sortedcnt[word]) for word in sortedcnt])
... print time.time()-start
...
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912
(The similarity of the results is not really surprising since we're in effect disabling Counter
's own methods and abusing it as a store for values obtained with the very same methodology as before.)
Therefore, my question: Is there a more idiomatic way to inform Counter
that its input list is already sorted and to make it thus keep the current key in memory rather than looking it up anew every time it - predictably - encounters the next token of the same word? In other words, is it possible to improve performance on a pre-sorted list further by combining the inherent efficiency of the Counter
/dictionary
class with the obvious benefits of a sorted list, or am I already scratching on a hard limit with .9 seconds for tallying a list of 2M entries?
There probably isn't a whole lot of room for improvement - I get times of around .55 seconds when doing the simplest thing I can think of that still requires iterating through the same list and checking each individual value, and .25 for set(corpus)
without a count, but maybe there's some itertools magic out there that would help to get close to those figures?
(Note: I'm a relative novice to Python and to programming in general, so excuse if I've missed something obvious.)
Edit Dec. 1:
Another thing, besides the sorting itself, which makes all of my methods above slow, is converting every single one of 2M strings into lowercase and stripping them of whatever punctuation or digits they may include. I have tried before to shortcut that by counting the unprocessed strings and only then converting the results and removing duplicates while adding up their counts, but I must have done something wrong for it made things ever so slightly slower. I therefore reverted to the previous versions, converting everything in the raw corpus, and now can't quite reconstruct what I did there.
If I try it now, I do get an improvement from converting the strings last. I'm still doing it by looping over a list (of results). What I did was write a couple of functions that would between them convert the keys in the output of J.F. Sebastian's winning default_dict method (of format [("word", int), ("Word", int)], ("word2", int),...]
) into lowercase and strip them of punctuation, and collapse the counts for all keys that were left identical after that operation (code below). The advantage is that we're now handling a list of roughly 50k entries as opposed to the > 2M in the corpus. This way I'm now at 1.25 seconds for going from the corpus (as a list) to a case insensitive wor