These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.
I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:
>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)
>> term_freq = count_vectorizer.fit_transform(texts)
>> transformer = TfidfTransformer()
>> X = transformer.fit_transform(term_freq)
>> print X.shape
(26, 6323)
Inserting those 26 samples of 6323 features (X) and associated scores (y), into a LinearRegression
model, gives good predictions. These are obtained using leave-one-out cross validation, from cross_validation.LeaveOneOut(X.shape[0], indices=True)
:
using ngrams (n=1):
human machine points-off %error
8.67 8.27 0.40 1.98
8.00 7.33 0.67 3.34
... ... ... ...
5.00 6.61 1.61 8.06
9.00 7.50 1.50 7.50
mean: 7.59 7.64 1.29 6.47
std : 1.94 0.56 1.38 6.91
Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:
using ngrams (n=300):
human machine points-off %error
8.67 7.55 1.12 5.60
8.00 7.57 0.43 2.13
... ... ... ...
mean: 7.59 7.59 1.52 7.59
std : 1.94 0.08 1.32 6.61
Question 1: This might mean that the prediction model is overfitting the data. I only know this because I chose an extreme value for the ngrams (n=300) which I KNOW can't produce good results. But if I didn't have this knowledge, how would you normally tell that the model is over-fitting? In other words, if a reasonable measure (n=1) were used, how would you know that the good prediction was a result of being overfit vs. the model just working well?
Question 2: What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?
Question 3: If LeaveOneOut
cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.
Please answer any of the questions even if you don't know them all. Thanks!
See Question&Answers more detail:
os