Xml processing in Spark

Question

Welcome To Ask or Share your Answers For Others

Xml processing in Spark

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Xml processing in Spark

Scenario: My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.

Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.

Sample XML:

<root>
    <users>
        <user>
              <account>1234<account>
              <name>name_1<
ame>
              <number>34233<
umber>
         <user>
         <user>
              <account>58789<account>
              <name>name_2<
ame>
              <number>54697<
umber>
         <user>    
    <users>
<
oot>

How will this be loaded into the RDD?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:07:24+0000

Yes it possible but details will differ depending on an approach you take.

If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
- If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
- Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
- use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
- use second mapPartitionsWithIndex to repair broken records

Edit:

There is also relatively new spark-xml package which allows you to extract specific records by tag:

val df = sqlContext.read
  .format("com.databricks.spark.xml")
   .option("rowTag", "foo")
   .load("bar.xml")

Categories

Xml processing in Spark

Xml processing in Spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags