The most succinct approach is to use the tools Python gives you.
from future_builtins import map # Only on Python 2
from collections import Counter
from itertools import chain
def countInFile(filename):
with open(filename) as f:
return Counter(chain.from_iterable(map(str.split, f)))
That's it. map(str.split, f)
is making a generator that returns list
s of words from each line. Wrapping in chain.from_iterable
converts that to a single generator that produces a word at a time. Counter
takes an input iterable and counts all unique values in it. At the end, you return
a dict
-like object (a Counter
) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.
In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict
or collections.defaultdict(int)
to count (because Counter
is implemented in Python, which can make it slower in some cases), but letting Counter
do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter
). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter
has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.
Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:
from string import punctuation
def countInFile(filename):
with open(filename) as f:
linewords = (line.translate(None, punctuation).lower().split() for line in f)
return Counter(chain.from_iterable(linewords))
Your code runs much more slowly because it's creating and destroying many small Counter
and set
objects, rather than .update
-ing a single Counter
once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…