Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
341 views
in Technique[技术] by (71.8m points)

Apache Spark : how to insert data in a column with empty values in dataFrame using Java

I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.

Both DataFrames have 2 common columns.

Is there a way to do same using Java? Or there can be different approach?

Sample Input :

1) File1.csv

BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN     ,404154,1000,Y
0681220958,BIN     ,735332,1000,Y
5992410180,BIN     ,454680,1000,Y
6995270884,SREBIN  ,1000252750295575,1000,Y

Here BILL_ID is system id and BILL_NBR is external id.

2) File2.csv

TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC     ,"     ",BIN     ,404154
22365, XYZ     ,"     ",BIN     ,735332
45890, LKJ     ,"     ",BIN     ,454680
23456, MPK     ,"     ",SREBIN  ,1000252750295575

Sample Output

As shown below BILL_ID value should be populated in File2.csv

01234, ABC     ,501841898,BIN     ,404154
22365, XYZ     ,681220958,BIN     ,735332
45890, LKJ     ,5992410180,BIN     ,454680
23456, MPK     ,6995270884,SREBIN  ,1000252750295575

I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.

EDIT

Basically I want clarity on below three steps:

  1. how to get BILL_NBR and BILL_NBR_TYPE_CD values from File2.csv?

For this step I have written : file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");

  1. How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?

  2. How to update BILL_ID values accordingly in File2.csv ?

I am new to spark and I would appreciate if someone can give pointers.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You need to join two tables based on BILL_NBR column.

Assumption: There is one to one relation between BILL_NBR and BILL_ID columns.

Assuming that your dataframe names for File1.csv and File2.csv are file1DF and file2DF respectively, following should work for you:

Dataset<Row> file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset<Row> file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset<Row> file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));

Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...