python - How to extract specific content in a pandas dataframe with a regex?

Question

Welcome To Ask or Share your Answers For Others

python - How to extract specific content in a pandas dataframe with a regex?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to extract specific content in a pandas dataframe with a regex?

Consider the following pandas dataframe:

In [114]:

df['movie_title'].head()

?
Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1995)
3    Get Shorty (1995)
4       Copycat (1995)
...
Name: movie_title, dtype: object

Update: I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: ([^dW]+). So I tried the following:

df_3['movie_title'] = df_3['movie_title'].str.extract('([^dW]+)')
df_3['movie_title']

However, I get the following:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:

Out[114]:

0     Toy Story
1     GoldenEye
2    Four Rooms
3    Get Shorty
4       Copycat
...
Name: movie_title, dtype: object

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:42:31+0000

You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex and strip leading and trailing whitespaces:

#convert column to string
df['movie_title'] = df['movie_title'].astype(str)

#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'([^)]*)', '').str.strip()
print df
          movie_title      titles      titles1      titles2
0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
4      Copycat (1995)     Copycat      Copycat      Copycat

Categories

python - How to extract specific content in a pandas dataframe with a regex?

python - How to extract specific content in a pandas dataframe with a regex?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags