Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
478 views
in Technique[技术] by (71.8m points)

python - Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly means? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (5 txt files)?

How is it different when min_df and max_df are provided as integers or as floats?

The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of min_df and/or max_df. Could someone provide an explanation or example demonstrating min_df or max_df.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

  • max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
  • max_df = 25 means "ignore terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.


min_df is used for removing terms that appear too infrequently. For example:

  • min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
  • min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...