Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
764 views
in Technique[技术] by (71.8m points)

scala - How to handle multi line rows in spark?

I am having a dataframe which has some multi-line observations:

+--------------------+----------------+
|         col1|               col2|
+--------------------+----------------+
|something1           |somethingelse1  |
|something2           |somethingelse2  |
|something3           |somethingelse3  |
|something4           |somethingelse4  |
|multiline

 row               |     somethings|
|something            |somethingall    |

What I want is to save in csv format(or txt) this dataframe. Using the following:

df
 .write
 .format("csv")
 .save("s3://../adf/")

But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:

df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")

but the same output was observed.

I can imagine that one way is to replace with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting

sparkSession.read
  .option("parserLib", "univocity")
  .option("multiLine", "true")
  .csv(file)

Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...