This is what I am trying to explain:
>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575
Answer: this is explained by Bessel's correction, N-1
instead of N
in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.
There is a related discussion here, but their suggestions do not work either.
I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):
>>> df
restaurant_id price
id
1 10407 7
3 10407 20
6 10407 22
13 10407 22
Question: r.mi.groupby('restaurant_id')['price'].mean()
returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std()
returns wrong values.
As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:
>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575
We can get the same (correct) values with
>>> np.mean(df)
restaurant_id 10407.00
price 17.75
dtype: float64
>>> np.std(df)
restaurant_id 0.000000
price 6.259992
dtype: float64
(Of course, disregard the mean restaurant id.) Obviously, np.std(df)
is not a solution when I have more than one restaurant. So I am using groupby
.
>>> df.groupby('restaurant_id').agg('std')
price
restaurant_id
10407 7.228416
What?! 7.228416 is not 6.259992.
Let's try again.
>>> df.groupby('restaurant_id').std()
Same thing.
>>> df.groupby('restaurant_id')['price'].std()
Same thing.
>>> df.groupby('restaurant_id').apply(lambda x: x.std())
Same thing.
However, this works:
for id, group in df.groupby('restaurant_id'):
print id, np.std(group['price'])
Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?
See Question&Answers more detail:
os