scala - Spark Fixed Width File Import Large number of columns causing high Execution time

Question

Welcome To Ask or Share your Answers For Others

scala - Spark Fixed Width File Import Large number of columns causing high Execution time

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Spark Fixed Width File Import Large number of columns causing high Execution time

I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.

Code read the text file as RDD with

sparkContext.textFile("abc.txt")

then reads JSON schema and gets the column names and width of each column.

In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.

The representative code

rdd1=spark.sparkContext.textfile("file1")

{ var now=0
 { val collector= new array[String] (ColLenghth.length) 
 val recordlength=line.length
for (k<- 0 to colLength.length -1)
 { collector(k) = line.substring(now,now+colLength(k))
 now =now+colLength(k)
 }
 collector.toSeq}


StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths



StringArray.toDF("StringCol")
  .select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
  .write.mode("overwrite").parquet("c"home")

This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns. As number of columns increases , it also increase the time.

If anyone has faced such issue with large number of columns. I need suggestions on performance tuning , how can I tune this Job or code

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

scala - Spark Fixed Width File Import Large number of columns causing high Execution time

scala - Spark Fixed Width File Import Large number of columns causing high Execution time

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags