Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
537 views
in Technique[技术] by (71.8m points)

nlp - Difference between Python's collections.Counter and nltk.probability.FreqDist

I want to calculate the term-frequencies of words in a text corpus. I've been using NLTK's word_tokenize followed by probability.FreqDist for some time to get this done. The word_tokenize returns a list, which is converted to a frequency distribution by FreqDist. However, I recently came across the Counter function in collections (collections.Counter), which seems to be doing the exact same thing. Both FreqDist and Counter have a most_common(n) function which return the n most common words. Does anyone know if there's a difference between these two? Is one faster than the other? Are there cases where one would work and the other wouldn't?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

nltk.probability.FreqDist is a subclass of collections.Counter.

From the docs:

A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

The inheritance is explicitly shown from the code and essentially, there's no difference in terms of how a Counter and FreqDist is initialized, see https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106

So speed-wise, creating a Counter and FreqDist should be the same. The difference in speed should be insignificant but it's good to note that the overheads could be:

  • the compilation of the class in when defining it in an interpreter
  • the cost of duck-typing .__init__()

The major difference is the various functions that FreqDist provides for statistical / probabilistic Natural Language Processing (NLP), e.g. finding hapaxes. The full list of functions that FreqDist extends Counter are as followed:

>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])

When it comes to using FreqDist.most_common(), it's actually using the parent function from Counter so the speed of retrieving the sorted most_common list is the same for both types.

Personally, when I just want to retrieve counts, I use collections.Counter. But when I need to do some statistical manipulation, I either use nltk.FreqDist or I would dump the Counter into a pandas.DataFrame (see Transform a Counter object into a Pandas DataFrame).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...