Python pandas - group by column A and prevent duplicated existence on column b?

Question

Welcome To Ask or Share your Answers For Others

Python pandas - group by column A and prevent duplicated existence on column b?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Python pandas - group by column A and prevent duplicated existence on column b?

Suppose I have a dataframe

df = pd.DataFrame({"SKU": ["Coke", "Coke", "Coke", "Bread", "Bread", "Bread", "cake", "cake", "cake"], 
              "campaign":["buy1get1","$19", "event", "buy1get1","$19", "event", "buy1get1","$19", "event"],
                   "score": [0.9, 0.8, 0.4, 0.7, 0.6, 0.3, 0.5, 0.7, 0.5]})

    SKU    campaign score
0   Coke    buy1get1    0.9
1   Coke    $19         0.8
2   Coke    event       0.4
3   Bread   buy1get1    0.7
4   Bread   $19         0.6
5   Bread   event       0.3
6   cake    buy1get1    0.5
7   cake    $19         0.7
8   cake    event       0.5

I want to get the best product for each campaign which would be

df.sort_values("score").groupby("campaign", as_index=False).last()

and leads to the following output

    campanign   SKU score
0   $19         Coke    0.8
1   buy1get1    Coke    0.9
2   event       cake    0.5

But What I want is as follows as coke is already used in buy1get1 campaign and has higher score.

    campaign    SKU     score
0   $19         cake     0.7
1   event       Bread    0.3
2   buy1get1    Coke     0.9

logic:

go to the second largest value for campaign $19 because Coke is used already for campaign buy1get1 with higher score (0.9>0.8).
and then we got cake for campaign $19, that means we cannot use cake for campaign "event" anymore. and thus we go to third largest number for event: Bread

I have tried to think of some ways, but none of them are efficient/pythonic. I will need to deal with a large data set.

Instead of going into inefficient loop and/or other chaos, is there a better way for clarifying this kind of issue?

Your opinion and information will be much appreciated.

question from:https://stackoverflow.com/questions/65932090/python-pandas-group-by-column-a-and-prevent-duplicated-existence-on-column-b

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:59:55+0000

One idea is get product which matching last values and remove duplicates only for this product in original DataFrame and then run your solution again with new data:

df1 = df.sort_values("score")

last = df1.groupby("campaign")['SKU'].last()
mask = ~df1['SKU'].isin(last) | ~df1['SKU'].duplicated(keep='last')

df = df1[mask].groupby("campaign", as_index=False).last()
print (df)
   campaign    SKU  score
0       $19   cake    0.7
1  buy1get1   Coke    0.9
2     event  Bread    0.3

Categories

Python pandas - group by column A and prevent duplicated existence on column b?

Python pandas - group by column A and prevent duplicated existence on column b?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags