There is no probability estimate in the training
Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.
Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features)
to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…