Spark: Reading files using different delimiter than new line

Question

Welcome To Ask or Share your Answers For Others

Spark: Reading files using different delimiter than new line

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:30:27+0000

You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g.,

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)

For example, my input is a file containing one line aXbXcXd. The above code will output

Array(a, b, c, d)

Categories

Spark: Reading files using different delimiter than new line

Spark: Reading files using different delimiter than new line

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags