Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
Before:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
After:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
Hope this helps!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…