The pandas.DataFrame.query()
method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.
I find myself often wanting to apply the same logic to a pandas.Series
, e.g. after having done a method such as df.value_counts
which returns a pandas.Series
.
Example
Lets assume there is a huge table with the columns Player, Game, Points
and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg
) which will return a Series of ~1000 players and their overall points. Applying the .query
logic it would look something like this:
df = pd.DataFrame({
'Points': [random.choice([1,3]) for x in range(100)],
'Player': [random.choice(["A","B","C"]) for x in range(100)]})
(df
.query("Points == 3")
.Player.values_count()
.query("> 14")
.hist())
The only solutions I find force me to do an unnecessary assignment and break the method chaining:
(points_series = df
.query("Points == 3")
.groupby("Player").size()
points_series[points_series > 100].hist()
Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.
# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Please help me out of my dilemma! Thanks
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…