python - Where do you need to use lit() in Pyspark SQL?

Question

Welcome To Ask or Share your Answers For Others

python - Where do you need to use lit() in Pyspark SQL?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Where do you need to use lit() in Pyspark SQL?

I'm trying to make sense of where you need to use a lit value, which is defined as a literal column in the documentation.

Take for example this udf, which returns the index of a SQL column array:

def find_index(column, index):
    return column[index]

If I were to pass an integer into this I would get an error. I would need to pass a lit(n) value into the udf to get the correct index of an array.

Is there a place I can better learn the hard and fast rules of when to use lit and possibly col as well?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:29:32+0000

To keep it simple you need a Column (can be a one created using lit but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you wan to call a Column specific method.

In the first case the only strict rule is the on that applies to UDFs. UDF (Python or JVM) can be called only with arguments which are of Column type. It also typically applies to functions from pyspark.sql.functions. In other cases it is always best to check documentation and docs string firsts and if it is not sufficient docs of a corresponding Scala counterpart.

In the second case rules are simple. If you for example want to compare a column to a value then value has to be on the RHS:

col("foo") > 0  # OK

or value has to be wrapped with literal:

lit(0) < col("foo")  # OK

In Python many operators (<, ==, <=, &, |, + , -, *, /) can use non column object on the LHS:

0 < col("foo")

but such applications are not supported in Scala.

It goes without saying that you have to use lit if you want to access any of the pyspark.sql.Column methods treating standard Python scalar as a constant column. For example you'll need

c = lit(1)

not

c = 1

to

c.between(0, 3)  # type: pyspark.sql.Column

Categories

python - Where do you need to use lit() in Pyspark SQL?

python - Where do you need to use lit() in Pyspark SQL?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags