Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
347 views
in Technique[技术] by (71.8m points)

python - Pyparsing: Parsing semi-JSON nested plaintext data to a list

I have a bunch of nested data in a format that loosely resembles JSON:

company="My Company"
phone="555-5555"
people=
{
    person=
    {
        name="Bob"
        location="Seattle"
        settings=
        {
            size=1
            color="red"
        }
    }
    person=
    {
        name="Joe"
        location="Seattle"
        settings=
        {
            size=2
            color="blue"
        }
    }
}
places=
{
    ...
}

There are many different parameters with varying levels of depth--this is just a very small subset.

It also might be worth noting that when a new sub-array is created that there is always an equals sign followed by a line break followed by the open bracket (as seen above).

Is there any simple looping or recursion technique for converting this data to a system-friendly data format such as arrays or JSON? I want to avoid hard-coding the names of properties. I am looking for something that will work in Python, Java, or PHP. Pseudo-code is fine, too.

I appreciate any help.

EDIT: I discovered the Pyparsing library for Python and it looks like it could be a big help. I can't find any examples for how to use Pyparsing to parse nested structures of unknown depth. Can anyone shed light on Pyparsing in terms of the data I described above?

EDIT 2: Okay, here is a working solution in Pyparsing:

def parse_file(fileName):

#get the input text file
file = open(fileName, "r")
inputText = file.read()

#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair


#define data types that might be in the values
real = Regex(r"[+-]?d+.d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary

#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)

#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()

This works, but when I try to write the results to a file with json.dump(result), the contents of the file are wrapped in double quotes. Also, there are chraacters between many of the data pairs. I tried suppressing them in the code above with LineEnd().suppress() , but I must not be using it correctly.



Okay, I came up with a final solution that actually transforms this data into a JSON-friendly Dict as I originally wanted. It first using Pyparsing to convert the data into a series of nested lists and then loops through the list and transforms it into JSON. This allows me to overcome the issue where Pyparsing's toDict() method was not able to handle where the same object has two properties of the same name. To determine whether a list is a plain list or a property/value pair, the prependPropertyToken method adds the string __property__ in front of property names when Pyparsing detects them.

def parse_file(self,fileName):
            
            #get the input text file
            file = open(fileName, "r")
            inputText = file.read()


            #define data types that might be in the values
            real = Regex(r"[+-]?d+.d*").setParseAction(lambda x: float(x[0]))
            integer = Regex(r"[+-]?d+").setParseAction(lambda x: int(x[0]))
            yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
            no = CaselessKeyword("no").setParseAction(replaceWith(False))
            quotedString.setParseAction(removeQuotes)
            unquotedString =  Word(alphanums+"_-?"")
            comment = Suppress("#") + Suppress(restOfLine)
            EQ,LBRACE,RBRACE = map(Suppress, "={}")
            
            data = (real | integer | yes | no | quotedString | unquotedString)
            
            #define structures
            value = Forward()
            object = Forward() 
            
            dataList = Group(OneOrMore(data))
            simpleArray = (LBRACE + dataList + RBRACE)
            
            propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
            property = dictOf(propertyName + EQ, value)
            properties = Dict(property)
            
            object << (LBRACE + properties + RBRACE)
            value << (data | object | simpleArray)
            
            dataset = properties.ignore(comment)
            
            #parse it
            result = dataset.parseString(inputText)
            
            #turn it into a JSON-like object
            dict = self.convert_to_dict(result.asList())
            return json.dumps(dict)
            
    
    
    def convert_to_dict(self, inputList):
            dict = {}
            for item in inputList:
                    #determine the key and value to be inserted into the dict
                    dictval = None
                    key = None
                    
                    if isinstance(item, list):
                            try:
                                    key = item[0].replace("__property__","")
                                    if isinstance(item[1], list):
                                            try:
                                                    if item[1][0].startswith("__property__"):
                                                            dictval = self.convert_to_dict(item)
                                                    else:
                                                            dictval = item[1]
                                            except AttributeError:
                                                    dictval = item[1]
                                    else:
                                            dictval = item[1]
                            except IndexError:
                                    dictval = None
                    #determine whether to insert the value into the key or to merge the value with existing values at this key
                    if key:
                            if key in dict:
                                    if isinstance(dict[key], list):
                                            dict[key].append(dictval)
                                    else:
                                            old = dict[key]
                                            new = [old]
                                            new.append(dictval)
                                            dict[key] = new
                            else:
                                    dict[key] = dictval
            return dict

    
                    
    def prependPropertyToken(self,t):
            return "__property__" + t[0]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Parsing an arbitrarily nested structure can be done with pyparsing by defining a placeholder to hold the nested part, using the Forward class. In this case, you are just parsing simple name-value pairs, where then value could itself be a nested structure containing name-value pairs.

name :: word of alphanumeric characters
entry :: name '=' value
struct :: '{' entry* '}'
value :: real | integer | quotedstring | struct

This translates to pyparsing almost verbatim. To define value, which can recursively contain values, we first create a Forward() placeholder, which can be used as part of the definition of entry. Then once we have defined all the possible types of values, we use the '<<' operator to insert this definition into the value expression:

EQ,LBRACE,RBRACE = map(Suppress,"={}")

name = Word(alphas, alphanums+"_")
value = Forward()
entry = Group(name + EQ + value)

real = Regex(r"[+-]?d+.d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

struct = Group(LBRACE + ZeroOrMore(entry) + RBRACE)
value << (quotedString | struct | real | integer)

The parse actions on real and integer will convert these elements from strings to float or ints at parse time, so that the values can be used as their actual types immediately after parsing (no need to post-process to do string-to-other-type conversion).

Your sample is a collection of one or more entries, so we use that to parse the total input:

result = OneOrMore(entry).parseString(sample)

We can access the parsed data as a nested list, but it is not so pretty to display. This code uses pprint to pretty-print a formatted nested list:

from pprint import pprint
pprint(result.asList())

Giving:

[['company', 'My Company'],
 ['phone', '555-5555'],
 ['people',
  [['person',
    [['name', 'Bob'],
     ['location', 'Seattle'],
     ['settings', [['size', 1], ['color', 'red']]]]],
   ['person',
    [['name', 'Joe'],
     ['location', 'Seattle'],
     ['settings', [['size', 2], ['color', 'blue']]]]]]]]

Notice that all the strings are just strings with no enclosing quotation marks, and the ints are actual ints.

We can do just a little better than this, by recognizing that the entry format actually defines a name-value pair suitable for accessing like a Python dict. Our parser can do this with just a few minor changes:

Change the struct definition to:

struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE)

and the overall parser to:

result = Dict(OneOrMore(entry)).parseString(sample)

The Dict class treats the parsed contents as a name followed by a value, which can be done recursively. With these changes, we can now access the data in result like elements in a dict:

print result['phone']

or like attributes in an object:

print result.company

Use the dump() method to view the contents of a structure or substructure:

for person in result.people:
    print person.dump()
    print

prints:

['person', ['name', 'Bob'], ['location', 'Seattle'], ['settings', ['size', 1], ['color', 'red']]]
- location: Seattle
- name: Bob
- settings: [['size', 1], ['color', 'red']]
  - color: red
  - size: 1

['person', ['name', 'Joe'], ['location', 'Seattle'], ['settings', ['size', 2], ['color', 'blue']]]
- location: Seattle
- name: Joe
- settings: [['size', 2], ['color', 'blue']]
  - color: blue
  - size: 2

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...