Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
248 views
in Technique[技术] by (71.8m points)

python - Is there a way to make collections.Counter (Python2.7) aware that its input list is sorted?

The Problem

I've been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comparing their efficiency. As far as I can tell, in the normal case with an unsorted list, the Countermethod from the collections module is superior to anything I've come up with or found elsewhere, but it doesn't seem to take much advantage of the benefits of a pre-sorted list and I've come up with methods that beat it easily in this special case. So, in short, is there any built-in way to inform Counter of the fact that a list is already sorted to further speed it up?

(The next section is on unsorted lists where Counter works magic; you may want to skip to towards the end where it looses its charm when dealing with sorted lists.)

Unsorted input lists

One approach that doesn't work

The naive approach would be to use sorted([(word, corpus.count(word)) for word in set(corpus)]), but that one reliably gets you into runtime problems as soon as your corpus is a few thousand items long - not surprisingly since you're running through the entire list of n words m many times, where m is the number of unique words.

Sorting the list + local search

So what I tried to do instead before I found Counter was make sure that all searches are strictly local by first sorting the input list (I also have to remove digits and punctuation marks and convert all entries into lower case to avoid duplicates like 'foo', 'Foo', and 'foo:').

#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk 

# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues 
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words()) 

import string
drop = string.punctuation+string.digits  

def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
    '''function for extracting word frequencies out of a corpus. Returns an alphabetic list
    of tuples consisting of words contained in the corpus with their frequencies.  
    Default is case-insensitive, but if you need separate entries for upper and lower case 
    spellings of the same words, set option Case=True. If your input list is already sorted
    and stripped of punctuation marks/digits and/or all lower case, you can accelerate the 
    operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
    # you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
    if lower or Case:
        if StrippedSorted:
            sortedc = corpus 
        else:    
            sortedc = sorted([word.replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # here we sort and purge the input list in the default case:
    else:
            sortedc = sorted([word.lower().replace('--',' ').strip(drop)
                   for word in sorted(corpus)])
    # start iterating over the (sorted) input list:
    scindex = 0
    # create a list:
    freqs = []
    # identify the first token:
    currentword = sortedc[0]
    length = len(sortedc)
    while scindex < length:
        wordcount = 0
        # increment a local counter while the tokens == currentword
        while scindex < length and sortedc[scindex] == currentword:
            scindex += 1
            wordcount += 1
        # store the current word and final score when a) a new word appears or
        # b) the end of the list is reached
        freqs.append((currentword, wordcount))
        # if a): update currentword with the current token
        if scindex < length:
            currentword = sortedc[scindex]
    return freqs

Finding collections.Counter

This is much better, but still not quite as fast as using the Counter class from the collections module, which creates a dictionary of {word: frequency of word} entries (we still have to do the same stripping and lowering, but no sorting):

from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
    cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])

On the Gutenberg corpus with appr. 2 million entries, the Counter method is roughly 30% faster on my machine (5 seconds as opposed to 7.2), which is mostly explained through the sorting routine which eats around 2.1 seconds (if you don't have and don't want to install the nltk package (Natural Language Toolkit) which offers access to this corpus, any other adequately long text appropriately split into a list of strings at word level will show you the same.)

Comparing performance

With my idiosyncratic method of timing using the tautology as a conditional to delay execution, this gives us for the counter method:

import time
>>> if 1:
...     start = time.time()
...     cnt = Counter()
...     for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
...         cnt[word] += 1
...     time.time()-start
...     cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
...     time.time()-start
... 
4.999882936477661
5.191655874252319

(We see that the last step, that of formatting the results as a list of tuples, takes up less than 5% of the total time.)

Compared to my function:

>>> if 1:
...     start = time.time()
...     gbfreqs = wordcount5(corpus)
...     time.time()-start
... 
7.261770963668823

Sorted input lists - when Counter 'fails'

However, as you may have noticed, my function allows to specify that the input is already sorted, stripped of punctuational garbage, and converted to lowercase. If we already have created such a converted version of the list for some other operations, using it (and declaring so) can very much speed up the operation of my wordcount5:

>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
...     start = time.time()
...     strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
...     time.time()-start
... 
0.9050078392028809

Here, we've reduced runtime by a factor of appr. 8 by not having to sort the corpus and convert the items. Of course the latter is also true when feeding Counter with this new list, so expectably it's also a bit faster, but it doesn't seem to take advantage of the fact that it is sorted and now takes twice as long as my function where it was 30% faster before:

>>> if 1:
...     start = time.time()
...     cnt = Counter()
...     for word in sorted_corpus:
...         cnt[word] += 1
...     time.time()-start
...     strippedgbfreqs = [(word, cnt[word])for word in cnt]
...     time.time()-start
... 
1.9455058574676514
2.0096349716186523

Of course, we can use the same logic I used in wordcount5 - incrementing a local counter until we run into a new word and only then storing the last word with the current state of the counter, and resetting the counter to 0 for the next word - only using Counter as storage, but the inherent efficiency of the Counter method seems lost, and performance is within the range of my function's for creating a dictionary, with the extra burden of converting to a list of tuples now looking more troublesome than it used to when we were processing the raw corpus:

>>> def countertest():
...     start = time.time()
...     sortedcnt = Counter()
...     c = 0
...     length = len(sorted_corpus)
...     while c < length:
...         wcount = 0
...         word = sorted_corpus[c]
...         while c < length and sorted_corpus[c] == word:
...             wcount+=1
...             c+=1
...         sortedcnt[word] = wcount
...         if c < length:
...             word = sorted_corpus[c]
...     print time.time()-start
...     return sorted([(word, sortedcnt[word]) for word in sortedcnt])
...     print time.time()-start
... 
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912

(The similarity of the results is not really surprising since we're in effect disabling Counter's own methods and abusing it as a store for values obtained with the very same methodology as before.)

Therefore, my question: Is there a more idiomatic way to inform Counter that its input list is already sorted and to make it thus keep the current key in memory rather than looking it up anew every time it - predictably - encounters the next token of the same word? In other words, is it possible to improve performance on a pre-sorted list further by combining the inherent efficiency of the Counter/dictionary class with the obvious benefits of a sorted list, or am I already scratching on a hard limit with .9 seconds for tallying a list of 2M entries?

There probably isn't a whole lot of room for improvement - I get times of around .55 seconds when doing the simplest thing I can think of that still requires iterating through the same list and checking each individual value, and .25 for set(corpus) without a count, but maybe there's some itertools magic out there that would help to get close to those figures?

(Note: I'm a relative novice to Python and to programming in general, so excuse if I've missed something obvious.)

Edit Dec. 1:

Another thing, besides the sorting itself, which makes all of my methods above slow, is converting every single one of 2M strings into lowercase and stripping them of whatever punctuation or digits they may include. I have tried before to shortcut that by counting the unprocessed strings and only then converting the results and removing duplicates while adding up their counts, but I must have done something wrong for it made things ever so slightly slower. I therefore reverted to the previous versions, converting everything in the raw corpus, and now can't quite reconstruct what I did there.

If I try it now, I do get an improvement from converting the strings last. I'm still doing it by looping over a list (of results). What I did was write a couple of functions that would between them convert the keys in the output of J.F. Sebastian's winning default_dict method (of format [("word", int), ("Word", int)], ("word2", int),...]) into lowercase and strip them of punctuation, and collapse the counts for all keys that were left identical after that operation (code below). The advantage is that we're now handling a list of roughly 50k entries as opposed to the > 2M in the corpus. This way I'm now at 1.25 seconds for going from the corpus (as a list) to a case insensitive wor


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Given a sorted list of words as you mention, have you tried the traditional Pythonic approach of itertools.groupby?

from itertools import groupby
some_data = ['a', 'a', 'b', 'c', 'c', 'c']
count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
# {'a': 2, 'c': 3, 'b': 1}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...