Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
371 views
in Technique[技术] by (71.8m points)

python - Iterate over list of strings to pull out substrings

I have a long list of different strings that all contain some information about a specific port across the globe. However, each port name is different and is contained in a different location within the string. What I want to do is loop over all of the strings, find the word 'Port' and then store the next two substrings after 'Port'. For example:

'Strong winds may disrupt operations at the Port of Rotterdam on July 5'

I find 'Port' and now want 'of Rotterdam' to be added onto 'Port' as a complete string, like 'Port of Rotterdam'. I thought there could be some way to split up each longer string by doing parts = my_str.split(' '). Then:

for i in parts:
    if i == 'Port':
        new_str = i

However, I am not sure how to add on the next two substrings. Ideas?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Take a look at list.index (also documented here):

parts = my_str.split(' ')
try:
    port_index = parts.index('Port')
except ValueError:
    pass # Port name not found
else:
    port_name = ' '.join(parts[port_index:port_index + 2])

You can of course do more advanced processing. For example, grab a sequence of uppercased words optionally preceded by a single of:

def find_name(sentence):
    """
    Get the port name or None.
    """
    parts = sentence.split(' ')
    try:
        start = parts.index('Port')
    except ValueError:
        return None
    else:
        if start == len(parts) - 1:
            return None

    end = start + 1
    if parts[end] == 'of':
        end = end + 1
    while end < len(parts) and parts[end][0].isupper():
        end += 1

    if end == start + 1 or (end == start + 2 and parts[start + 1] == 'of'):
        return None

    return ' '.join(parts[start:end])

Of course you can do the same thing with regex:

pattern = re.compile(r'Port(?:s+of)?(s+[A-Z]S+)+')
match = pattern.search(my_str)
print(match.group())

This regex will not properly match non-latin uppercase letters. You may want to investigate the solutions here for sufficiently foreign port names.

Both of the solutions here will work correctly for the following two test cases:

'Strong winds may disrupt operations at the Port of Rotterdam on July 5'
'Strong winds may disrupt operations at the Port of Fos-sur-Mer on July 5'
'Strong winds may disrupt operations at Port Said on July 5'

You can likely improve the search further, but this should give you the tools to get a solid start. At some point, if the sentences become complex enough, you may want to use natural language processing of some kind. For example, look into the nltk package.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...