python - using lxml and iterparse() to parse a big (+- 1Gb) XML file

Question

Welcome To Ask or Share your Answers For Others

python - using lxml and iterparse() to parse a big (+- 1Gb) XML file

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - using lxml and iterparse() to parse a big (+- 1Gb) XML file

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

Here's part of what I have:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

The result of that is only blank spaces, with no text in them.

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:32:20+0000

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

the final clear will stop you from using too much memory.

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(etree.tostring(element))
  element.clear()

or

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([etree.tostring(child) for child in element]))
  element.clear()

or perhaps even:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([child.text for child in element]))
  element.clear()

Categories

python - using lxml and iterparse() to parse a big (+- 1Gb) XML file

python - using lxml and iterparse() to parse a big (+- 1Gb) XML file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags