Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

xml - Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely.

So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code

from lxml import etree as et
from bz2 import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

which works perfectly with the Geofabrick extracts. However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:

xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60

Here are the things I tried:

  • Check the planet-latest.osm.bz2 md5sum
  • Check the planet-latest.osm where the script with bz2 stops. There is no apparent error, and the attribute is called "num_changes", not "num_change" as indicated in the error
  • Also I did something stupid, but the error puzzled me: I opened the planet-latest.osm.bz2 in mode 'rb' [c = BZ2File('file.osm.bz2', 'rb')] and then passed c.read() to iterparse(), which returned me an error saying (very long string) cannot be opened. Strange thing, (very long string) ends right where the "Specification mandate value" error refers to...

Then I tried to decompress first the planet.osm.gz2 usin a simple

bzcat planet.osm.gz2 > planet.osm

And ran the parser directly on planet.osm. And... it worked! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. My guess would be there is something going on between the decompression and the parsing, but I am not sure. Please help me understand!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It turns out that the problem is with the compressed planet.osm file.

As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!

So the code should read:

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...