Suppose I have a dataframe
df = pd.DataFrame({"SKU": ["Coke", "Coke", "Coke", "Bread", "Bread", "Bread", "cake", "cake", "cake"],
"campaign":["buy1get1","$19", "event", "buy1get1","$19", "event", "buy1get1","$19", "event"],
"score": [0.9, 0.8, 0.4, 0.7, 0.6, 0.3, 0.5, 0.7, 0.5]})
SKU campaign score
0 Coke buy1get1 0.9
1 Coke $19 0.8
2 Coke event 0.4
3 Bread buy1get1 0.7
4 Bread $19 0.6
5 Bread event 0.3
6 cake buy1get1 0.5
7 cake $19 0.7
8 cake event 0.5
I want to get the best product for each campaign
which would be
df.sort_values("score").groupby("campaign", as_index=False).last()
and leads to the following output
campanign SKU score
0 $19 Coke 0.8
1 buy1get1 Coke 0.9
2 event cake 0.5
But What I want is as follows as coke is already used in buy1get1 campaign and has higher score.
campaign SKU score
0 $19 cake 0.7
1 event Bread 0.3
2 buy1get1 Coke 0.9
logic:
- go to the second largest value for campaign $19 because Coke is used already for campaign buy1get1 with higher score (0.9>0.8).
- and then we got cake for campaign $19, that means we cannot use cake for campaign "event" anymore. and thus we go to third largest number for event: Bread
I have tried to think of some ways, but none of them are efficient/pythonic.
I will need to deal with a large data set.
Instead of going into inefficient loop and/or other chaos,
is there a better way for clarifying this kind of issue?
Your opinion and information will be much appreciated.
question from:
https://stackoverflow.com/questions/65932090/python-pandas-group-by-column-a-and-prevent-duplicated-existence-on-column-b