Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
127 views
in Technique[技术] by (71.8m points)

python 3.x - How to separate a string with 2 uppercases and a space with regex in pandas dataframe?

I have a dataframe column, teams, where I am trying to split the team name, 'CubsWhite Sox', into two parts, 'Cubs' and 'White Sox'.

import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}] 

df = pd.DataFrame(data) 
df

so far I could only achieve this result.

df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
    teams           area    league   team
0   CubsWhite Sox   Chicago MLB      [White Sox]
1   Red Sox         Boston  MLB      [Red Sox]
2   Blue Jay        Toronto MLB      [Blue Jay]

Also after white, red and blue there are two spaces as I have discovered from here.

df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White , Sox]
1   Red Sox         Boston  MLB     [Red , Sox]
2   Blue Jay        Toronto MLB     [Blue , Jay]

which I can easily remove with

df['teams'] = df['teams'].str.replace(r' +', '')

Can you help me to split these team names like this, please using re.findall?

df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White Sox]
1   Red Sox         Boston  MLB     [Red Sox]
2   Blue Jay        Toronto MLB     [Blue Jay]

thank you!

question from:https://stackoverflow.com/questions/65835567/how-to-separate-a-string-with-2-uppercases-and-a-space-with-regex-in-pandas-data

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use

df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:s+[A-Z][a-z]*)?')

See the regex demo. Details:

  • [A-Z][a-z]* - an uppercase letter followed with any zero or more lowercase letters
  • (?:s+[A-Z][a-z]*)? - an optional non-capturing group that matches
    • s+ - one or more whitespaces
    • [A-Z][a-z]* - an uppercase letter followed with any zero or more lowercase letters.

Pandas test:

>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:s+[A-Z][a-z]*)?')
0    [Cubs, White Sox]
1            [Red Sox]
2           [Blue Jay]
Name: teams, dtype: object

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...