sql - Spark Dataframe Nested Case When Statement

Question

Welcome To Ask or Share your Answers For Others

sql - Spark Dataframe Nested Case When Statement

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

sql - Spark Dataframe Nested Case When Statement

I need to implement the below SQL logic in Spark DataFrame

SELECT KEY,
    CASE WHEN tc in ('a','b') THEN 'Y'
         WHEN tc in ('a') AND amt > 0 THEN 'N'
         ELSE NULL END REASON,
FROM dataset1;

My input DataFrame is as below:

val dataset1 = Seq((66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")).toDF("KEY", "tc", "amt")

dataset1.show()

+---+---+---+
|KEY| tc|amt|
+---+---+---+
| 66|  a|  4|
| 67|  a|  0|
| 70|  b|  4|
| 71|  d|  4|
+---+---+---+

I have implement the nested case when statement as:

dataset1.withColumn("REASON", when(col("tc").isin("a", "b"), "Y")
  .otherwise(when(col("tc").equalTo("a") && col("amt").geq(0), "N")
    .otherwise(null))).show()

+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66|  a|  4|     Y|
| 67|  a|  0|     Y|
| 70|  b|  4|     Y|
| 71|  d|  4|  null|
+---+---+---+------+

Readability of the above logic with "otherwise" statement is little messy if the nested when statements goes further.

Is there any better way of implementing nested case when statements in Spark DataFrames?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:36:20+0000

There is no nesting here, therefore there is no need for otherwise. All you need is chained when:

import spark.implicits._

when($"tc" isin ("a", "b"), "Y")
  .when($"tc" === "a" && $"amt" >= 0, "N")

ELSE NULL is implicit so you can omit it completely.

Pattern you use, is more more applicable for folding over a data structure:

val cases = Seq(
  ($"tc" isin ("a", "b"), "Y"),
  ($"tc" === "a" && $"amt" >= 0, "N")
)

where when - otherwise naturally follows recursion pattern and null provides the base case.

cases.foldLeft(lit(null)) {
  case (acc, (expr, value)) => when(expr, value).otherwise(acc)
}

Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If tc is equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to NULL. You should rather:

when($"tc" === "a" && $"amt" >= 0, "N")
 .when($"tc" isin ("a", "b"), "Y")

Categories

sql - Spark Dataframe Nested Case When Statement

sql - Spark Dataframe Nested Case When Statement

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags