python - Why is lxml.etree.iterparse() eating up all my memory?

Question

Welcome To Ask or Share your Answers For Others

python - Why is lxml.etree.iterparse() eating up all my memory?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why is lxml.etree.iterparse() eating up all my memory?

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.

What am I doing wrong / how can I process this large file with iterparse()?

import lxml.etree

for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
    print "why does this consume all my memory?"

I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:20:11+0000

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.

In order to free some memory as you parse, use Liza Daly's fast_iter:

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

which you could then use like this:

def process_element(elem):
    print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events = ('end', ))
fast_iter(context, process_element)

I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.

The fast_iter presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the difference.

Categories

python - Why is lxml.etree.iterparse() eating up all my memory?

python - Why is lxml.etree.iterparse() eating up all my memory?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags