Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
378 views
in Technique[技术] by (71.8m points)

c# - Reading very large .xml.bz2 files

I'd like to parse Wikimedia's .xml.bzip2 dumps without extracting the entire file or performing any XML validation:

var filename = "enwiki-20160820-pages-articles.xml.bz2";

var settings = new XmlReaderSettings()
{
    ValidationType = ValidationType.None,
    ConformanceLevel = ConformanceLevel.Auto // Fragment ?
};

using (var stream = File.Open(filename, FileMode.Open))
using (var bz2 = new BZip2InputStream(stream))
using (var xml = XmlTextReader.Create(bz2, settings))
{
    xml.ReadToFollowing("page");
    // ...
}

The BZip2InputStream works - if I use a StreamReader, I can read XML line by line. But when I use XmlTextReader, it fails when I try to perform the read:

System.Xml.XmlException: 'Unexpected end of file has occurred. The following elements are not closed: mediawiki. Line 58, position 1.'

The bzip stream is not at EOF. Is it possible to open an XmlTextReader on top of a BZip2 stream? Or is there some other means to do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This should work. I used combination of XmlReader and Xml Linq. You can parse the XElement doc as needed.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication29
{
    class Program
    {
        const string URL = @"https://dumps.wikimedia.org/enwiki/20160820/enwiki-20160820-abstract26.xml";
        static void Main(string[] args)
        {
            XmlReader reader = XmlReader.Create(URL);

            while (!reader.EOF)
            {
                if (reader.Name != "doc")
                {
                    reader.ReadToFollowing("doc");
                }
                if (!reader.EOF)
                {
                    XElement doc = (XElement)XElement.ReadFrom(reader);
                }
            }

        }
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...