Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
479 views
in Technique[技术] by (71.8m points)

python - Why is collections.Counter much slower than ''.count?

I have a simple task: To count how many times every letter occurs in a string. I've used a Counter() for it, but on one forum I saw information that using dict() / Counter() is much slower than using string.count() for every letter. I thought that it would interate through the string only once, and the string.count() solution would have to iterate through it four times (in this case). Why is Counter() so slow?

>>> timeit.timeit('x.count("A");x.count("G");x.count("C");x.count("T")', setup="x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
0.07911698750407936
>>> timeit.timeit('Counter(x)', setup="from collections import Counter;x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
2.1727447831030844
>>> 2.1727447831030844 / 0.07911698750407936
27.462430656767047
>>> 
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Counter() allows you to count any hashable objects, not just substrings. Both solutions are O(n)-time. Your measurements show that the overhead of iterating and hashing individual characters by Counter() is greater than running s.count() 4 times.

Counter() can use C helper to count elements but it seems it doesn't special case strings and uses general algorithm applicable for any other iterable i.e., processing a single character involves multiple Python C API calls to advance the iterator, get previous value (a lookup in the hash table), increment counter, set new value (a lookup in the hash table):

    while (1) {
        key = PyIter_Next(it);
        if (key == NULL)
            break;
        oldval = PyObject_GetItem(mapping, key);
        if (oldval == NULL) {
            if (!PyErr_Occurred() || !PyErr_ExceptionMatches(PyExc_KeyError))
                break;
            PyErr_Clear();
            Py_INCREF(one);
            newval = one;
        } else {
            newval = PyNumber_Add(oldval, one);
            Py_DECREF(oldval);
            if (newval == NULL)
                break;
        }
        if (PyObject_SetItem(mapping, key, newval) == -1)
            break;
        Py_CLEAR(newval);
        Py_DECREF(key);
    }

Compare it to FASTSEARCH() overhead for bytestrings:

    for (i = 0; i < n; i++)
        if (s[i] == p[0]) {
           count++;
           if (count == maxcount)
              return maxcount;
        }
    return count;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...