Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
161 views
in Technique[技术] by (71.8m points)

Python pandas - group by column A and prevent duplicated existence on column b?

Suppose I have a dataframe

df = pd.DataFrame({"SKU": ["Coke", "Coke", "Coke", "Bread", "Bread", "Bread", "cake", "cake", "cake"], 
              "campaign":["buy1get1","$19", "event", "buy1get1","$19", "event", "buy1get1","$19", "event"],
                   "score": [0.9, 0.8, 0.4, 0.7, 0.6, 0.3, 0.5, 0.7, 0.5]})

    SKU    campaign score
0   Coke    buy1get1    0.9
1   Coke    $19         0.8
2   Coke    event       0.4
3   Bread   buy1get1    0.7
4   Bread   $19         0.6
5   Bread   event       0.3
6   cake    buy1get1    0.5
7   cake    $19         0.7
8   cake    event       0.5

I want to get the best product for each campaign which would be

df.sort_values("score").groupby("campaign", as_index=False).last()

and leads to the following output

    campanign   SKU score
0   $19         Coke    0.8
1   buy1get1    Coke    0.9
2   event       cake    0.5

But What I want is as follows as coke is already used in buy1get1 campaign and has higher score.

    campaign    SKU     score
0   $19         cake     0.7
1   event       Bread    0.3
2   buy1get1    Coke     0.9

logic:

  1. go to the second largest value for campaign $19 because Coke is used already for campaign buy1get1 with higher score (0.9>0.8).
  2. and then we got cake for campaign $19, that means we cannot use cake for campaign "event" anymore. and thus we go to third largest number for event: Bread

I have tried to think of some ways, but none of them are efficient/pythonic. I will need to deal with a large data set.

Instead of going into inefficient loop and/or other chaos, is there a better way for clarifying this kind of issue?

Your opinion and information will be much appreciated.

question from:https://stackoverflow.com/questions/65932090/python-pandas-group-by-column-a-and-prevent-duplicated-existence-on-column-b

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

One idea is get product which matching last values and remove duplicates only for this product in original DataFrame and then run your solution again with new data:

df1 = df.sort_values("score")

last = df1.groupby("campaign")['SKU'].last()
mask = ~df1['SKU'].isin(last) | ~df1['SKU'].duplicated(keep='last')

df = df1[mask].groupby("campaign", as_index=False).last()
print (df)
   campaign    SKU  score
0       $19   cake    0.7
1  buy1get1   Coke    0.9
2     event  Bread    0.3

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...