When calculating the standard deviation it matters whether you are estimating the standard deviation of an entire population with a smaller sample of that population or are you calculating the standard deviation of the entire population.
If it is a smaller sample of a larger population, you need what is called the sample standard deviation. As it turns out, when you divide the sum of squared differences from the mean by the number of observations, you end up with a biased estimator. We correct for that by dividing by one less than the number of observations. We control for this with the argument ddof=1
for sample standard deviation or ddof=0
for population standard deviation.
Truth is, it doesn't matter much if your sample size is large. But you will see small differences.
Use the degrees of freedom argument in your pandas.DataFrame.std
call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std(ddof=0)) # <<<
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
np.isclose(numeric_data1, numeric_data2).all() # -> True
Or in the np.std
call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0, ddof=1)) # <<<
np.isclose(numeric_data1, numeric_data2).all() # -> True
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…