apache spark - joined based on the column value

Question

Welcome To Ask or Share your Answers For Others

apache spark - joined based on the column value

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

apache spark - joined based on the column value

I am using spark-sql-2.4.1v how to do various joins depend on the value of column

Sample data

val data = List(
  ("20", "score", "school",  14 ,12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 ,14)
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school|    14|    12|
| 21|score|school|    13|    13|
| 22| rate|school|    11|    14|
| 21| rate|school|    13|    12|

based the "code" column value i need to do join with various other tables

val rateDs = // val data1= List(
  ("22", 11 ,A),
  ("22", 14 ,B),
  ("20", 13 ,C),
  ("21", 12 ,C),
  ("21", 13 ,D)
)

val df = data1.toDF("id", "map_code","map_val")

val scoreDs = // scoreTable

if the "code" column value is "rate" i need to join with rateDs if the "code" column value is "score" i need to join with scoreDs

how to handle these kind of things in spark ? any optimum way to achieve this?

Expected result for "rate" fields

+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 22| rate|school|     A|    B |
| 21| rate|school|     D|    C |

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:24+0000

You can simply join twice, for example

val data = List(
  ("20", "score", "school",  14 , 12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 , 14),
  ("21", "rate", "school",  13 , 12)    
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

val data1 = List(
  ("22", 11 ,"A"),
  ("22", 14 ,"B"),
  ("20", 13 ,"C"),
  ("21", 12 ,"C"),
  ("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")

df.as("a")
  .join(rateDF.as("b"),
       col("a.code") === lit("rate") 
        && col("a.id") === col("b.id") 
        && col("a.value1") === col("b.map_code"), "inner")
  .join(rateDF.as("c"),
       col("a.code") === lit("rate") 
        && col("a.id") === col("c.id") 
        && col("a.value2") === col("c.map_code"), "inner")
  .select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
  .show(false)

+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A     |B     |
|21 |rate|school|D     |C     |
+---+----+------+------+------+

Well, this looks a bit dirty, but I have no idea for the multiple columns...

Categories

apache spark - joined based on the column value

apache spark - joined based on the column value

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags