Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
578 views
in Technique[技术] by (71.8m points)

scala - How to access sub-entities in JSON file?

I have a json file look like this:

{
  "employeeDetails":{
    "name": "xxxx",
    "num":"415"
  },
  "work":[
    {
      "monthYear":"01/2007",
      "workdate":"1|2|3|....|31",
      "workhours":"8|8|8....|8"
    },
    {
      "monthYear":"02/2007",
      "workdate":"1|2|3|....|31",
      "workhours":"8|8|8....|8"
    }
  ]
}

I have to get the workdate, workhours from this json data.

I tried like this:

import org.apache.spark.{SparkConf, SparkContext}

object JSON2 {
  def main (args: Array[String]) {
    val spark =
      SparkSession.builder()
        .appName("SQL-JSON")
        .master("local[4]")
        .getOrCreate()

    import spark.implicits._

    val employees = spark.read.json("sample.json")
    employees.printSchema()
    employees.select("employeeDetails").show()
  }
}

I am getting exception like this:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`employeeDetails`' given input columns: [_corrupt_record];;
'Project ['employeeDetails]
+- Relation[_corrupt_record#0] json

I am new to Spark.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

given input columns: [_corrupt_record];;

The reason is that Spark supports JSON files in which "Each line must contain a separate, self-contained valid JSON object."

Quoting JSON Datasets:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.

In case a JSON file is incorrect for Spark it will store it under _corrupt_record (that you can change using columnNameOfCorruptRecord option).

scala> spark.read.json("employee.json").printSchema
root
 |-- _corrupt_record: string (nullable = true)

And your file is incorrect not only bacause it's a multi-line JSON, but also because jq (a lightweight and flexible command-line JSON processor) says so.

$ cat incorrect.json
{
  "employeeDetails":{
    "name": "xxxx",
    "num:"415"
  }
  "work":[
  {
    "monthYear":"01/2007"
    "workdate":"1|2|3|....|31",
    "workhours":"8|8|8....|8"
  },
  {
    "monthYear":"02/2007"
    "workdate":"1|2|3|....|31",
    "workhours":"8|8|8....|8"
  }
  ],
}
$ cat incorrect.json | jq
parse error: Expected separator between values at line 4, column 14

Once you fix the JSON file, use the following trick to load the multi-line JSON file.

scala> spark.version
res5: String = 2.1.1

val employees = spark.read.json(sc.wholeTextFiles("employee.json").values)
scala> employees.printSchema
root
 |-- employeeDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- num: string (nullable = true)
 |-- work: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- monthYear: string (nullable = true)
 |    |    |-- workdate: string (nullable = true)
 |    |    |-- workhours: string (nullable = true)

scala> employees.select("employeeDetails").show()
+---------------+
|employeeDetails|
+---------------+
|     [xxxx,415]|
+---------------+

Spark >= 2.2

As of Spark 2.2 (released quite recently and highly recommended to use), you should use multiLine option instead. multiLine option was added in SPARK-20980 Rename the option wholeFile to multiLine for JSON and CSV.

scala> spark.version
res0: String = 2.2.0

scala> spark.read.option("multiLine", true).json("employee.json").printSchema
root
 |-- employeeDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- num: string (nullable = true)
 |-- work: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- monthYear: string (nullable = true)
 |    |    |-- workdate: string (nullable = true)
 |    |    |-- workhours: string (nullable = true)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...