You may use
df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+s+PY)")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+s+LG)")
Or, to extract all matches and join them with a space:
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+PY)").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+LG)").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
Note you need to use a capturing group in the regex pattern so that extract
could actually extract the text:
Extract capture groups in the regex pat as columns in a DataFrame.
Note the
word boundary is necessary to match PY
/ LG
as a whole word.
Also, if you want to only start a match from a letter, you may revamp the pattern to
r"([a-zA-Z][a-zA-Z'-]*s+PY)"
r"([a-zA-Z][a-zA-Z'-]*s+LG)"
^^^^^^^^ ^
where [a-zA-Z]
will match a letter and [a-zA-Z'-]*
will match 0 or more letters, apostrophes or hyphens.
Python 3.7 with Pandas 0.24.2:
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)
df = pd.DataFrame({
'col_a': ['Python PY is a general-purpose language LG',
'Programming language LG in Python PY',
'Its easier LG to understand PY',
'The syntax of the language LG is clean PY',
'Python PY is a general purpose PY language LG']
})
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+PY)").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+LG)").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
Output:
col_a col_b_PY col_c_LG
0 Python PY is a general-purpose language LG Python PY language LG
1 Programming language LG in Python PY Python PY language LG
2 Its easier LG to understand PY understand PY easier LG
3 The syntax of the language LG is clean PY clean PY language LG
4 Python PY is a general purpose PY language LG Python PY purpose PY language LG
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…