Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.6k views
in Technique[技术] by (71.8m points)

python - Pandas dataframe: split long regex into multiple lines

I'm using Pandas for some data cleanup, and I have a very long regex which I would like to split into multiple lines. The following works fine in Pandas because it is all on one line:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r'(?=[^])}]*([[({]|$))(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)(?:s*(?:,s*)?(?:(?:or|and)s+)?(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*', value = r'<g<0>>', regex = True)

However, it is difficult to manage. I've tried the following verbose method which works in regular Python:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r"""(?=[^])}]*([[({]|$))
                                                      (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
                                                      (?:s*(?:,s*)?(?:(?:or|and)s+)?
                                                      (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*""", value = r'<g<0>>', regex = True)

This does not work in Pandas, though. Any ideas what I'm missing?

Here is some sample text for testing:

GR, MDT, CMR, HLDS, NEXT, NGI @ 25273, COMPTG

FIT 13.72 ON 9-7/8 LNR, LWD[GR,RES,APWD,SONVIS], MDTS (PRESS & SAMP) ROT SWC, TSTG BOP

LWD[GR,RES,APWD,SONVIS], GR, RES, NGI, PPC @ 31937, MDTS (PRESS & SAMP) TKG ROT SWC

LWD[GR,RES] @ 12586, IND, FDC, CNL, GR @ 12586, SWC, RAN CSG, PF 12240-12252, RR (ADDED INFO)

Thanks!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

One option is to create a list of strings and then use join when you call replace

RegEx = [r'(?=[^])}]*([[({]|$))(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)',
         r'(?:s*(?:,s*)?(?:(?:or|and)s+)?',
         r'(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*']

df['REMARKS'] = df['REMARKS'].replace(to_replace=''.join(RegEx), value=r'<g<0>>', regex=True)

Using re

import re

s = r"""(?=[^])}]*([[({]|$))(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
         (?:s*(?:,s*)?(?:(?:or|and)s+)?
         (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*"""

df['REMARKS'] = df['REMARKS'].replace(to_replace=re.compile(s, re.VERBOSE), value=r'<g<0>>')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...