I have parsing too big XML. When a node fails I want to keep looping and doing stuff with remaining nodes.
version 1
for event, element in etree.iterparse(file):
if element.tag == "tag1":
# Doing some stuff
with the first version I get an exception:
ParseError: not well-formed (invalid token): line 319851
So in order to process the remain nodes I have wrote a second version:
version 2
xml_parser = etree.iterparse(file)
while True:
try:
event, element = next(xml_parser)
if element.tag == "tag1":
# Doing some stuff
# If there is no more elements to iterate, breaks the loop
except StopIteration:
break
# While another exception, keep looping
except Exception as e:
pass
In that case the script entering in a infinite loop.
So I tried go to the specific line opening as a text file:
with open(file) as fp:
for i, line in enumerate(fp):
if i == 319850:
print(319850, line)
if i == 319851:
print(319851, line)
if i == 319852:
print(319852, line)
if i == 319853:
print(319853, line)
break
I get:
319850 <tag1> <tag11><![CDATA[ foo bar
319851 ]]></tag11></tag1>
319852 <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
319853 <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
so seems to be that line is cutted by "
". That is an XML error but why my second version does not works? In my second version, lines 319850 and 319851 are not valid as XML so should be pass and get the next nodes/lines.
What am I doing wrong here?
If you have a best approach please let me know.
UPDATE
XML file has an invalid character 'x0b'. So looks like:
<tag1> <tag11><![CDATA[ foo bar 'x0b']]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…