python - xml.etree.ElementTree.ParseError: not well-formed

Question

Welcome To Ask or Share your Answers For Others

python - xml.etree.ElementTree.ParseError: not well-formed

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

python - xml.etree.ElementTree.ParseError: not well-formed

I have the following code:

from xml.etree import ElementTree

file_path = 'some_file_path'

document = ElementTree.parse(file_path, ElementTree.XMLParser(encoding='utf-8'))

If my XML looks like the following it gives me the error: "xml.etree.ElementTree.ParseError: not well-formed"

<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>

In sublime or Notepad++ I see highlighted characters such as ACK, DC4, or STX which seem to be the culprit (one of them appears as a "-" in the above xml in the second "text" node). If I remove these characters it works. What are these and how can I fix this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:21:35+0000

Running your code as follows, and it's working fine:

from xml.etree import ElementTree
from StringIO import StringIO 


xml_content = """<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>"""

print("parsing xml document")
# using StringIO to simulate reading from file  
document = ElementTree.parse(StringIO(xml_content), ElementTree.XMLParser(encoding='utf-8')) 

for elem in document.iter():
  print(elem.tag)

And the output is as expected:

parsing xml document
pages
page
textbox
textline
text
text
text

So, the issue is how you are copying and pasting your file from notepad++, maybe it's adding some special characters so try with another editor.

Categories

python - xml.etree.ElementTree.ParseError: not well-formed

python - xml.etree.ElementTree.ParseError: not well-formed

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags