#Method 1:
You can use series.str.split
with explode
and the groupby.value_counts
(df.assign(text=df['text'].str.split()).explode("text")
.groupby("category",sort=False)['text'].value_counts())
category text
soccer game 2
soccer 2
good 1
is 1
basketball game 2
basketball 1
volleyball sport 2
volleyball 1
Name: text, dtype: int64
#Method 2:
For older version of pandas using np.concatenate
and index.repeat
with df.join
(There are other methods listed here)
s = df['text'].str.split()
(df[['category']].join(pd.Series(np.concatenate(s),
index=df.index.repeat(s.str.len()),name='text'))
.groupby("category",sort=False)['text'].value_counts())
#Method 3: using MultiLabelBinarizer
from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
s = df['text'].str.split()
mlb = MultiLabelBinarizer()
mlb.fit(s)
out = pd.DataFrame(mlb.transform(s),columns=mlb.classes_).groupby(df['category']).sum()
out.replace(0,np.nan).stack().astype(int)
category
basketball basketball 1
game 2
soccer game 2
good 1
is 1
soccer 2
volleyball sport 1
volleyball 1
dtype: int32
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…