Welcome To Ask or Share your Answers For Others

Spark DataFrame: count distinct values of every column

Welcome To Ask or Share your Answers For Others

1 Reply

replyed Oct 17, 2021 by 深蓝 (71.8m points)

In pySpark you could do something like this, using countDistinct():

from pyspark.sql.functions import col, countDistinct

df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col

df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)

If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

...