note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread
I have a 'dataset' with the following format:
id | value | ...
--------|-------|------
a | 156 | ...
b,c | 457 | ...
e,g,f,h | 346 | ...
... | ... | ...
and I would like to normalize it by duplicating all values for each ids:
id | value | ...
--------|-------|------
a | 156 | ...
b | 457 | ...
c | 457 | ...
e | 346 | ...
g | 346 | ...
f | 346 | ...
h | 346 | ...
... | ... | ...
What I'm doing is applying the split-apply-combine principle of pandas
using .groupby
that creates a tuple
for each group (groupby value, pd.DataFrame())
I created a column to group by that simply counts the ids in the row:
df['count_ids'] = df['id'].str.split(',').apply(lambda x: len(x))
id | value | count_ids
--------|-------|------
a | 156 | 1
b,c | 457 | 2
e,g,f,h | 346 | 4
... | ... | ...
The way I'm duplicating the rows is as follows:
pd.DataFrame().append([group]*count_ids)
I'm slowly progressing, but it is really complex, and I would appreciate any best practice or recommendation you can share with this type of problems.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…