null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.
NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0
.
One possible way to handle null values is to remove them with:
df.na.drop()
Or you can change them to an actual value (here I used 0) with:
df.na.fill(0)
Another way would be to select the rows where a specific column is null for further processing:
df.where(col("a").isNull())
df.where(col("a").isNotNull())
Rows with NaN can also be selected using the equivalent method:
from pyspark.sql.functions import isnan
df.where(isnan(col("a")))
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…