Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
426 views
in Technique[技术] by (71.8m points)

machine learning - Order between using validation, training and test sets

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.

Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).

In this wikipedia article, it seems to imply that the sequence should be:

  1. Split data into training set, validation set and test set
  2. Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
  3. Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
  4. Finally, use the test set to score the model fitted with the training set.

However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?

I see three alternative approachs, I am not sure if they would be correct.

First approach

  1. Split data into training set, validation set and test set
  2. For each polynomial degree, fit the model with the training set and give it a score using the validation set.
  3. For the polynomial degree with the best score, fit the model with the training set.
  4. Evaluate with the test set

Second approach

  1. Split data into training set, validation set and test set
  2. For each polynomial degree, use cross-validation only on the validation set to fit and score the model
  3. For the polynomial degree with the best score, fit the model with the training set.
  4. Evaluate with the test set

Third approach

  1. Split data into only two sets: the training/validation set and the test set
  2. For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
  3. For the polynomial degree with the best score, fit the model with the training/validation set.
  4. Evaluate with the test set

So the question is:

  • Is the wikipedia article wrong or am I missing something?
  • Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.

There are two separate ways of approaching the problem:

  • Either you use an explicit validation set to do hyperparameter search & tuning
  • Or you use cross-validation

So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).

After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).

So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don't assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.

Questions of which approach is "better" cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most "traditional" (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...