Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
179 views
in Technique[技术] by (71.8m points)

python - Writing with lxml emitting no whitespace even when pretty_print=True

I'm using the lxml library to read an xml template, insert/change some elements, and save the resulting xml. One of the elements which I'm creating on the fly using the etree.Element and etree.SubElement methods:

tree = etree.parse(r'xml_archiveemplatesmetadata_template_pts.xml')
root = tree.getroot()

stream = []
for element in root.iter():
    if isinstance(element.tag, basestring):
        stream.append(element.tag)

        # Find "keywords" element and insert a new "theme" element
        if element.tag == 'keywords' and 'theme' not in stream:
            theme = etree.Element('theme')
            themekt = etree.SubElement(theme, 'themekt').text = 'None'
            for tk in themekeys:
                themekey = etree.SubElement(theme, 'themekey').text = tk
            element.insert(0, theme)

prints to the screen nicely print etree.tostring(theme, pretty_print=True):

<theme>
  <themekt>None</themekt>
  <themekey>Hydrogeology</themekey>
  <themekey>Stratigraphy</themekey>
  <themekey>Floridan aquifer system</themekey>
  <themekey>Geology</themekey>
  <themekey>Regional Groundwater Availability Study</themekey>
  <themekey>USGS</themekey>
  <themekey>United States Geological Survey</themekey>
  <themekey>thickness</themekey>
  <themekey>altitude</themekey>
  <themekey>extent</themekey>
  <themekey>regions</themekey>
  <themekey>upper confining unit</themekey>
  <themekey>FAS</themekey>
  <themekey>base</themekey>
  <themekey>geologic units</themekey>
  <themekey>geology</themekey>
  <themekey>extent</themekey>
  <themekey>inlandWaters</themekey>
</theme>

However, when using etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True) to write out the xml, this element gets flattened in the output file:

<theme><themekt>None</themekt><themekey>Hydrogeology</themekey><themekey>Stratigraphy</themekey><themekey>Floridan aquifer system</themekey><themekey>Geology</themekey><themekey>Regional Groundwater Availability Study</themekey><themekey>USGS</themekey><themekey>United States Geological Survey</themekey><themekey>thickness</themekey><themekey>altitude</themekey><themekey>extent</themekey><themekey>regions</themekey><themekey>upper confining unit</themekey><themekey>FAS</themekey><themekey>base</themekey><themekey>geologic units</themekey><themekey>geology</themekey><themekey>extent</themekey><themekey>inlandWaters</themekey></theme>

The rest of the file is written nicely, but this particular element is causing (purely aesthetic) trouble. Any ideas of what I'm doing wrong?


Below is a snippet of markup from the template xml file (save this as "template.xml" to run with code snippet at bottom). The flattening of tags only occurs when I parse an existing file and insert a new element, not when the xml is created from scratch using lxml.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="fgdc_classic.xsl"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://water.usgs.gov/GIS/metadata/usgswrd/fgdc-std-001-1998.xsd">
    <keywords>
       <theme>
            <themekt>ISO 19115 Topic Categories</themekt>
            <themekey>environment</themekey>
            <themekey>geoscientificInformation</themekey>
            <themekey>inlandWaters</themekey>
        </theme>
        <place>
            <placekt>None</placekt>
            <placekey>Florida</placekey>
            <placekey>Georgia</placekey>
            <placekey>Alabama</placekey>
            <placekey>South Carolina</placekey>
        </place>
    </keywords>

</metadata>

Below is a snippet of code to be used with the snippet of markup (above):

# Create new theme element to insert into root
themekeys = ['Hydrogeology', 'Stratigraphy', 'inlandWaters']

tree = etree.parse(r'template.xml')
root = tree.getroot()

stream = []
for element in root.iter():
    if isinstance(element.tag, basestring):
        stream.append(element.tag)

        # Edit theme keywords
        if element.tag == 'keywords':
            theme = etree.Element('theme')
            themekt = etree.SubElement(theme, 'themekt').text = 'None'
            for tk in themekeys:
                themekey = etree.SubElement(theme, 'themekey').text = tk
            element.insert(0, theme)

# Write XML to new file
out_xml_file = 'test.xml'
etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True)
with open(out_xml_file, 'r') as f:
    lines = f.readlines()

with open(out_xml_file, 'w') as f:
    f.write('<?xml version="1.0" encoding="UTF-8"?>
')
    for line in lines:
        f.write(line)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you replace this line:

tree = etree.parse(r'template.xml')

with these lines:

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(r'template.xml', parser)

then it will work as expected. The trick is to use an XMLParser that has the remove_blank_text option set to True. Any existing ignorable whitespace will be removed and will therefore not disrupt the subsequent pretty-printing.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...