Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
293 views
in Technique[技术] by (71.8m points)

python - Splitting a string via Regex and Maxsplit returns multiple splits

I have a list of strings taken from chat log data, and I am trying to find the optimal method of splitting the speaker from the content of speech. Two examples are as follows:

mystr = ['bob123 (5:09:49 PM): hi how are you', 
         'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']

Note that, while they are broadly similar, there are some stylistic differences I need to account for (inclusion of dates, period marks, extra spaces etc.). I require a way to standardize and split such strings, and others like these, into something like the following list:

mystrList = [['bob123','hi how are you'],['jane_r16','What day is it today']]

Given that I do not need the times, numbers, or most punctuation, i thought a reasonable first step would be to remove anything non-essential. After doing so, I now have the following:

myCleanstr = ['bob(): hi how are you','janer() : What day is it today?']

Doing this has given me a pretty unique sequence of characters per string (): that is unlikely to appear elsewhere in the same string. My subsequent thinking was to use this as a de-marker to split each string using Regex:

mystr_split = [re.split(r'()( ){,2}:', i, maxsplit=1, flags=re.I) for i in myCleanstr]

Here, my intention was the following:

  • () Find a sequence of an open followed by a closed parentheses symbol
  • ( ){,2} Then find zero, one, or two whitespaces
  • : Then find a colon symbol

However, in both instances, I receive three objects per string. I get the correct speaker ID, and speech content. But, in the first string I get an additional NoneType Object, and in the second string I get an additional string filled with a single white-space.

I had assumed that including maxsplit=1 would mean that the process would end after the first split has been found, but this doesn't appear to be the case. Rather than filter my results on the content I need I would like to understand why it is performing as it is.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use

^(S+)s*([^()]*)s*:s*(.+)

Or, if the name can have whitespaces:

^(S[^(]*?)s*([^()]*)s*:s*(.+)

See the regex demo #1 and regex demo #2. The regex matches:

  • ^ - start of string
  • (S+) - Group 1: any one or more whitespace chars
  • [^(]*? - zero or more chars other than a ( char, as few as possible
  • s* - zero or more whitespaces
  • ( - a ( char
  • [^()]* - zero or more chars other than ( and )
  • ) - a ) char
  • s*:s* - a colon enclosed with zero or more whitespaces
  • (.+) - Group 2: any one or more chars other than line break chars, as many as possible (the whole rest of the line).

See the Python demo:

import re
result = []
mystr = ['bob123 (5:09:49 PM): hi how are you', 'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
for s in mystr:
    m = re.search(r'^(S+)s*([^()]*)s*:s*(.+)', s)
    if m:
        result.append([z for z in m.groups()])
print(result)
# => [['bob123', 'hi how are you'], ['jane_r16', 'What day is it today?']]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...