python - Removing an element from a parsed XML tree disrupts iteration

Question

Welcome To Ask or Share your Answers For Others

python - Removing an element from a parsed XML tree disrupts iteration

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Removing an element from a parsed XML tree disrupts iteration

I want to parse an xml file, then process the result tree by removing selected elements. My problem is that removing an element disrupts the loop that iterates over the elements.

Consider the following xml data:

<results>
    <group>
        <a />
        <b />
        <c />
    </group>
</results>

and the code:

import xml.etree.ElementTree as ET

def showGroup(group,s):
    print(s + '  len=' + str(len(group)))
    print('<group>' )
    for e in group:
        print('   <' + e.tag + '>')
    print('</group>
')

def processGroup(group):
    for e in group:
        if e.tag != 'a':
            group.remove(e)
            showGroup(group,'removed <' + e.tag + '>')

tree = ET.parse('x.xml')
root = tree.getroot()

for group in root:
    processGroup(group)

I expected the for loop to process elements <a>, <b>, and <c> in order. In particular:

processing <a> should not remove any element
processing <b> should remove <b>
processing <c> should remove <c>

I expected the resulting tree to have a single element inside <group> (the <a> element), and that len(group) would return 1.

Instead, after processing <b>, the for loop decides the end test has been met, and it does not process element <c>. If it did, <c> would be removed. Instead, I am left with a tree with elements <a> and <c>, and len(group) returns 2.

What do I need to do to process all three elements while removing selected elements? PS: any comments on style or better ways to do something are welcome.

Update: an ugly hack "fixes" the problem at the cost of some efficiency, if there is no code after removing the element. But in my real program, there is a lot of code after the pruning loop.

for e in group:
    if e.tag != 'a':
        group.remove(e)
        showGroup(group,'removed <' + e.tag + '>')
        processGroup(group)

I assume that if the for loop is disrupted, then starting again with the group at the beginning might solve the problem. Recursion is a tidy way of doing that - at the expense of reprocessing all elements that have already been checked but not removed.

I am not satisfied with this solution.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:57:43+0000

The issue is you are removing elements from something you are iterating over, when you remove an element the remaining elements get shifted so you can end up removing the incorrect elements:

A simple solution is to iterate over a copy of the tree or use reversed:

copy:

 def processGroup(group):
    # creates a shallow copy so we are removing from the original
    # but iterating over a copy. 
    for e in group[:]:
        if e.tag != 'a':
            group.remove(e)
            showGroup(group,'removed <' + e.tag + '>')

reversed:

def processGroup(group):
    # starts at the end, as the container shrinks.
    # when an element is removed, we still see
    # elements at the same position when we started out loop.
    for e in reversed(group):
        if e.tag != 'a':
            group.remove(e)
            showGroup(group,'removed <' + e.tag + '>')

using the copy logic:

In [7]: tree = ET.parse('test.xml')

In [8]: root = tree.getroot()

In [9]: for group in root:
   ...:         processGroup(group)
   ...:     
removed <b>  len=2
<group>
   <a>
   <c>
</group>

removed <c>  len=1
<group>
   <a>
</group>

You can also use ET.tostring in place of your for loop:

import xml.etree.ElementTree as ET

def show_group(group,s):
    print(s + '  len=' + str(len(group)))
    print(ET.tostring(group))


def process_group(group):
    for e in group[:]:
        if e.tag != 'a':
            group.remove(e)
            show_group(group, 'removed <' + e.tag + '>')

tree = ET.parse('test.xml')
root = tree.getroot()

for group in root.findall(".//group"):
    process_group(group)

Categories

python - Removing an element from a parsed XML tree disrupts iteration

python - Removing an element from a parsed XML tree disrupts iteration

What do I need to do to process all three elements while removing selected elements? PS: any comments on style or better ways to do something are welcome.

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags