Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
200 views
in Technique[技术] by (71.8m points)

python - Extracting specific values for a header in different lines using regex

I have text string which has multiple lines and each line has mix of characters/numbers and spaces etc.

Here is how a couple lines look like:

WEIGHT                         VOLUME                    CHARGEABLE                PACKAGES
                                                             
398.000 KG                     4.999 M3                  833.500 KG                12 PLT
                                                                                         
MAWB                                    HAWB
    / MH616 /                                                                                         
8947806753                             ABC20018830
  

Output I am looking for is to extract the above headers as keys and their values as values of a dict.

{ 
 "WEIGHT": 398.00 KG, 
 "VOLUME" : 4.99 M3,
 "CHAREGABLE" : 833.500 KG,
 "PACKAGES": 12 PLT,
 "MAWB"  : 8947806753,
 "HAWB"  : ABC20018830
} 

I am not sure how to fetch the value for a particular field from a different line under it. If its in same line I can fetch using a pattern. But not sure how to fetch it from a different line (the value of the field is directly underneath it in a different line).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use a regex to easily split the text into a list containing all the fields:

import re

a = "WEIGHT                         VOLUME                    CHARGEABLE                PACKAGES
                                                                         398.000 KG                     4.999 M3                  833.500 KG                12 PLT
                                                                                         MAWB                                    HAWB
    / MH616 /                                                                                           8947806753                             ABC20018830
"

# Split on 4 (or more) whitespace (leaves the units with the numbers)
data = re.split(r's{4,}', a)
print(data)

['WEIGHT', 'VOLUME', 'CHARGEABLE', 'PACKAGES', '398.000 KG', '4.999 M3', '833.500 KG', '12 PLT', 'MAWB', 'HAWB', '/ MH616 /', '8947806753', 'ABC20018830 ']

Since the keys and values are mixed, there probably isn't an easy way to automatically determine which is which. However if they are always in the same position, you can pick them out manually, e.g.:

b = {
    # WEIGHT
    data[0]: data[4],
    # VOLUME
    data[1]: data[5]
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...