Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
652 views
in Technique[技术] by (71.8m points)

python - Remove whitespaces in XML string

How can I remove the whitespaces and line breaks in an XML string in Python 2.6? I tried the following packages:

etree: This snippet keeps the original whitespaces:

xmlStr = '''<root>
    <head></head>
    <content></content>
</root>'''

xmlElement = xml.etree.ElementTree.XML(xmlStr)
xmlStr = xml.etree.ElementTree.tostring(xmlElement, 'UTF-8')
print xmlStr

I can not use Python 2.7 which would provide the method parameter.

minidom: just the same:

xmlDocument = xml.dom.minidom.parseString(xmlStr)
xmlStr = xmlDocument.toprettyxml(indent='', newl='', encoding='UTF-8')
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The easiest solution is probably using lxml, where you can set a parser option to ignore white space between elements:

>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> xml_str = '''<root>
>>>     <head></head>
>>>     <content></content>
>>> </root>'''
>>> elem = etree.XML(xml_str, parser=parser)
>>> print etree.tostring(elem)
<root><head/><content/></root>

This will probably be enough for your needs, but some warnings to be on the safe side:

This will just remove whitespace nodes between elements, and try not to remove whitespace nodes inside elements with mixed content:

>>> elem = etree.XML('<p> spam <a>ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p> spam <a>ham</a> <a>eggs</a></p>

Leading or trailing whitespace from textnodes will not be removed. It will however still in some circumstances remove whitespace nodes from mixed content: if the parser has not encountered non-whitespace nodes at that level yet.

>>> elem = etree.XML('<p><a> ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p><a> ham</a><a>eggs</a></p>

If you don't want that, you can use xml:space="preserve", which will be respected. Another option would be using a dtd and use etree.XMLParser(load_dtd=True), where the parser will use the dtd to determine which whitespace nodes are significant or not.

Other than that, you will have to write your own code to remove the whitespace you don't want (iterating descendants, and where appropriate, set .text and .tail properties that contain only whitespace to None or empty string)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...