I am having a simple file of size 7 GB in which each line containing two column separated by |.I have created RDD from this file but when i use map or filter transformation on this RDD i gets too many byte exception.
below is sample data from my file .
116010100000000007|33448
116010100000000014|13520
116010100000000021|97132
116010100000000049|82891
116010100000000049|82890
116010100000000056|93014
116010100000000063|43434
116010100000000063|43434
here is the code
val input = sparkContext.textFile("hdfsfilePath");
input.filter(x=>x.split("|")(1).toInt > 15000).saveAsTextFile("hdfs://output file path")
Below is the Exception i am getting .
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:136)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…