I was trying to print total elements in each partitions in a DataFrame using spark 2.2
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
spark = SparkSession.builder.appName("tmp").getOrCreate()
num_parts = 3
df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)
print("df has partitions."+ str(df.rdd.getNumPartitions()))
print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))
The Code above kept failing with following error
n = sum(1 for _ in iterator)
File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _
jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'
after removing the import below
from pyspark.sql.functions import *
Code works fine
skewed_large_df has partitions.3
The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)]
What is it causing this error and how can I fix it?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…