I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"home")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…