I'd like to parse Wikimedia's .xml.bzip2 dumps without extracting the entire file or performing any XML validation:
var filename = "enwiki-20160820-pages-articles.xml.bz2";
var settings = new XmlReaderSettings()
{
ValidationType = ValidationType.None,
ConformanceLevel = ConformanceLevel.Auto // Fragment ?
};
using (var stream = File.Open(filename, FileMode.Open))
using (var bz2 = new BZip2InputStream(stream))
using (var xml = XmlTextReader.Create(bz2, settings))
{
xml.ReadToFollowing("page");
// ...
}
The BZip2InputStream
works - if I use a StreamReader
, I can read XML line by line. But when I use XmlTextReader
, it fails when I try to perform the read:
System.Xml.XmlException: 'Unexpected end of file has occurred. The following elements are not closed: mediawiki. Line 58, position 1.'
The bzip stream is not at EOF. Is it possible to open an XmlTextReader on top of a BZip2 stream? Or is there some other means to do this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…