Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
263 views
in Technique[技术] by (71.8m points)

python - select rows from pandas data frame based on another column % percentage of occurrence

I have a data set like this

1A1HI_R071_PH_INSPECT_VIS_1_2_201231_025816.JPG 1A  1A1HI
1A1JK_R071_PH_INSPECT_VIS_1_2_210115_121554.JPG 1A  1A1JK
1P3G6_R071_PH_INSPECT_VIS_2_2_201231_034741.JPG 1P  1P3G6
1P3GC_R071_PH_INSPECT_VIS_3_2_201107_140047.JPG 1P  1P3GC
M10L0_R071_PH_INSPECT_VIS_6_2_201121_071741.JPG M1  M10L0
M10L4_R071_PH_INSPECT_VIS_8_2_201201_142646.JPG M1  M10L4
S5148_R071_PH_INSPECT_VIS_1_2_201127_042210.JPG S5  S5148
S516U_R071_PH_INSPECT_VIS_5_2_201222_074443.JPG S5  S516U
V929S_R071_PH_INSPECT_VIS_8_2_201120_144633.JPG V9  V929S
V92B0_R071_PH_INSPECT_VIS_4_2_201121_095537.JPG V9  V92B0
V92B0_R071_PH_INSPECT_VIS_4_2_201121_095539.JPG V9  V92B0
V92EM_R071_PH_INSPECT_VIS_2_2_210105_133406.JPG V9  V92EM
W405K_R071_PH_INSPECT_VIS_11_2_201021_230940.JPG    W4  
W405O_R071_PH_INSPECT_VIS_2_2_201206_095433.JPG W4  W405O
W40EW_R071_PH_INSPECT_VIS_3_3_201219_120634.JPG W4  W40EW
W40EW_R071_PH_INSPECT_VIS_5_3_201220_072010.JPG W4  W40EW
W40EW_R071_PH_INSPECT_VIS_5_3_201220_072019.JPG W4  W40EW
X103K_R071_PH_INSPECT_VIS_2_3_210112_185054.JPG X1  X103K
1A1HI_R071_PH_INSPECT_VIS_1_4_201231_025833.JPG 1A  1A1HI
1A1RE_R071_PH_INSPECT_VIS_1_4_201227_153637.JPG 1A  1A1RE
1P3G6_R071_PH_INSPECT_VIS_2_4_201231_034806.JPG 1P  1P3G6
1P3GC_R071_PH_INSPECT_VIS_3_4_201107_140102.JPG 1P  1P3GC
1P3HO_R071_PH_INSPECT_VIS_6_4_201214_113511.JPG 1P  1P3HO
1P3HQ_R071_PH_INSPECT_VIS_5_4_201207_191653.JPG 1P  1P3HQ
5X6X6_R071_PH_INSPECT_VIS_3_4_201211_142453.JPG 5X  5X6X6
A70NG_R071_PH_INSPECT_VIS_5_4_201025_182537.JPG A7  A70NG
M10L0_R071_PH_INSPECT_VIS_6_4_201121_071750.JPG M1  M10L0
M10L4_R071_PH_INSPECT_VIS_8_4_201201_142701.JPG M1  M10L4
V929S_R071_PH_INSPECT_VIS_8_4_201120_144651.JPG V9  V929S
V92EM_R071_PH_INSPECT_VIS_2_4_210105_133438.JPG V9  V92EM
W405O_R071_PH_INSPECT_VIS_2_4_201206_095500.JPG W4  W405O
W4078_R071_PH_INSPECT_VIS_5_4_201215_153919.JPG W4  W4078
W40BK_R071_PH_INSPECT_VIS_2_4_210113_175802.JPG W4  W40BK
W40EW_R071_PH_INSPECT_VIS_5_4_201220_072024.JPG W4  W40EW
1A1HI_R071_PH_INSPECT_VIS_1_5_201231_025836.JPG 1A  1A1HI
1A1JK_R071_PH_INSPECT_VIS_1_5_210115_121617.JPG 1A  1A1JK
1A1RE_R071_PH_INSPECT_VIS_1_5_201227_153639.JPG 1A  1A1RE
1P3G6_R071_PH_INSPECT_VIS_2_5_201231_034809.JPG 1P  1P3G6
1P3GC_R071_PH_INSPECT_VIS_3_5_201107_140105.JPG 1P  1P3GC

The first column is image name, the second column is the product name

There are these many products in the dataset. enter image description here

How can I select Images(1st column) based on the 2nd column percentage of occurrence?

For example, I need to select 170 random rows(images) of 1st column that contain 1P in the 2nd column, 156 random rows that contain 1A so on to get 20% of images in each product category to build a training set

question from:https://stackoverflow.com/questions/65878207/select-rows-from-pandas-data-frame-based-on-another-column-percentage-of-occur

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use DataFrameGroupBy.sample to sample rows in each category.

n = 0.2  # 20% per category
# Sample dataframe
df = pandas.DataFrame({
        'image_id': [1,2,3,4,5,6,7],
        'product_category': ['A', 'A', 'A', 'A', 'A', 'B', 'B']
})

df.groupby('product_category').sample(frac=n)

However, please note that some category may return no rows if their sampled count falls below 1.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...