I have a dataframe
df = url browser loadtime
A safari 1500
A safari 1650
A Chrome 2800
B IE 3150
B safari 3300
C Chrome 2650
. . .
. . .
I need to compute the upper outlier of the load time per app using the 3 QI rule of thumb and then filter df keeping only rows where for each app, loadtime is less than the upper outlier for this same app.
This is how I proceed.
- I compute the upper outlier using the 3QI rule of thumb
def upper_outlier(x):
return np.percentile(x, 75) + 3*(np.percentile(x,75)-np.percentile(x,25))
## Find the upper outlier threshold per app
df_grouped = df.groupby("app")['loadtime'].agg([('upper_outlier', lambda x : upper_outlier(x))])
This way for each app I have the corresponding upper outlier
- I filter
df
using df_grouped
df_new = pd.DataFrame()
for app in df.app.unique():
df_new = pd.concat([df_new,df.loc[(df.app==app)&(df.loadtime<df_grouped.loc[app, 'upper_outlier'])]], axis = 0).reset_index(drop=True)
The for loop takes long as I have a lot of data. Is there a cleaner pythonic way of doing this?
question from:
https://stackoverflow.com/questions/66049765/filter-a-dataframe-based-on-a-specific-value-for-each-category-in-pandas 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…