Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.2k views
in Technique[技术] by (71.8m points)

pandas - All Python Permutations and Combinations of Pinyin (Mandarin Romanization)

I'm looking to make a list of toneless pinyin combinations/permutations.

import pandas as pd
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]
data['pinyin'] = data['pinyin'].str.replace('d+', '')

The current format of the data is:

| pinyin| character|
|------|----|---|---|---|
| cang | 仓 |   |   |   |
| cang | 藏 |   |   |   |
| cao  | 操 |   |   |   |
| cao  | 曹 |   |   |   |
| cao  | 草 |   |   |   |

The expected result would be a list like:

cangcang
cangcao
caocang
caocao

I can dedupe the list and clean myself. I'm just trying to include every combination in every order of two pinyin.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can drop_duplicates, and then use an outer addition to get all combinations.

import numpy as np
import pandas as pd

s = df['pinyin'].drop_duplicates().to_numpy()
pd.Series(np.add.outer(s, s).ravel())

#0    cangcang
#1     cangcao
#2     caocang
#3      caocao
#dtype: object

If you want to add back the original words just add `s` back to this outer addition.

pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())
#0        cang
#1         cao
#2    cangcang
#3     cangcao
#4     caocang
#5      caocao
#dtype: object

If you want to have the individual words also then we can accomplish a similar thing with a merge, instead of dropping down to numpy. drop_duplicates again and assign a temporary key to accomplish the entire merge, then add the strings.

s = df[['pinyin']].drop_duplicates().assign(key=1)
res = s.merge(s, on='key').drop(columns='key')
res['combined'] = res['pinyin_x'] + res['pinyin_y']

#  pinyin_x pinyin_y  combined
#0     cang     cang  cangcang
#1     cang      cao   cangcao
#2      cao     cang   caocang
#3      cao      cao    caocao

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...