You can use textinputformat.record.delimiter
to set the delimiter for TextInputFormat
, E.g.,
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)
For example, my input is a file containing one line aXbXcXd
. The above code will output
Array(a, b, c, d)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…