Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
601 views
in Technique[技术] by (71.8m points)

Read multiline JSON in Apache Spark

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:

val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")

val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()

df.printSchema() result:

root
 |-- _corrupt_record: string (nullable = true)

My JSON file:

{
  "id": 1,
  "name": "Morty",
  "age": 21
}

Exeption:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];

How can I fix it?

UPD

_corrupt_record is

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|            "id": 1,|
|    "name": "Morty",|
|           "age": 21|
|                   }|
+--------------------+

UPD2

It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.

{"id": 1, "name": "Morty", "age": 21}

So the problem is in a newline.

UPD3

I found in docs the next sentence:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Spark >= 2.2

Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")

See:

  • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
  • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.

Spark < 2.2

Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.

It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.

That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:

spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...