Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
139 views
in Technique[技术] by (71.8m points)

Xml processing in Spark

Scenario: My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.

Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.

Sample XML:

<root>
    <users>
        <user>
              <account>1234<account>
              <name>name_1<
ame>
              <number>34233<
umber>
         <user>
         <user>
              <account>58789<account>
              <name>name_2<
ame>
              <number>54697<
umber>
         <user>    
    <users>
<
oot>

How will this be loaded into the RDD?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes it possible but details will differ depending on an approach you take.

  • If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
  • For larger files you can use Hadoop input formats.
    • If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
    • Otherwise Mahout provides XmlInputFormat
  • Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:

    • use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
    • use second mapPartitionsWithIndex to repair broken records

Edit:

There is also relatively new spark-xml package which allows you to extract specific records by tag:

val df = sqlContext.read
  .format("com.databricks.spark.xml")
   .option("rowTag", "foo")
   .load("bar.xml")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...