Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
221 views
in Technique[技术] by (71.8m points)

python 3.x - Parsing XML with invalid nodes

I have parsing too big XML. When a node fails I want to keep looping and doing stuff with remaining nodes.

version 1

for event, element in etree.iterparse(file):
    if element.tag == "tag1":
        # Doing some stuff

with the first version I get an exception:

ParseError: not well-formed (invalid token): line 319851

So in order to process the remain nodes I have wrote a second version:

version 2

xml_parser = etree.iterparse(file)

while True:
    try:
        event, element = next(xml_parser)

        if element.tag == "tag1":
            # Doing some stuff
        # If there is no more elements to iterate, breaks the loop
        except StopIteration:
            break

        # While another exception, keep looping
        except Exception as e:
            pass 

In that case the script entering in a infinite loop.

So I tried go to the specific line opening as a text file:

with open(file) as fp:
    for i, line in enumerate(fp):
        if i == 319850:
            print(319850, line)
        if i == 319851:
            print(319851, line)
        if i == 319852:
            print(319852, line)
        if i == 319853:
            print(319853, line)

            break

I get:

319850    <tag1> <tag11><![CDATA[ foo bar

319851    ]]></tag11></tag1>

319852    <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

319853    <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

so seems to be that line is cutted by " ". That is an XML error but why my second version does not works? In my second version, lines 319850 and 319851 are not valid as XML so should be pass and get the next nodes/lines.

What am I doing wrong here? If you have a best approach please let me know.

UPDATE

XML file has an invalid character 'x0b'. So looks like:

<tag1> <tag11><![CDATA[ foo bar 'x0b']]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I have taken those lines that seem to be causing trouble and stuffed them into a slightly bigger xml file for trial purposes. This is it.

<whole>
<tag1>
<tag11>one</tag11>
<tag11><![CDATA[ foo bar
]]></tag11>
<tag11>two</tag11>
<tag11>three</tag11>
</tag1>
<tag1> <tag11><![CDATA[ foo bar
]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1>
<tag11>three</tag11>
<tag11>four</tag11>
<tag11>five</tag11>
<tag11>six</tag11>
</tag1>
</whole>

Then I ran the following code that displayed its results at the end.

>>> import os
>>> os.chdir('c:/scratch')
>>> from lxml import etree
>>> context = etree.iterparse('temp.xml')
>>> for action, elem in context:
...     print (action, elem.tag, elem.sourceline)
...     
end tag11 3
end tag11 4
end tag11 6
end tag11 7
end tag1 2
end tag11 9
end tag1 9
end tag11 11
end tag1 11
end tag11 12
end tag1 12
end tag11 14
end tag11 15
end tag11 16
end tag11 17
end tag1 13
end whole 1

In short, there seems to be nothing wrong with those lines.

You could try printing the line numbers in which tags were found, in order to find the vicinity of the place giving trouble in the xml. (This is an edit based on knowledge that I have newly acquired on SO.)

I would also suggest using the looping structure suggested in the documentation to avoid the infinite loop. That's what I did in this code.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...