Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

pandas - toPandas() error using pyspark: 'int' object is not iterable

I have a pyspark dataframe and I am trying to convert it to pandas using toPandas(), however I am running into below mentioned error.

I tried different options but got the same error:
1) limit the data to just few records
2) used collect() explicitly (which I believe toPandas() uses inherently)

Explored many posts on SO, but AFAIK none has toPandas() issue.

Snapshot of my dataframe:-

>>sc.version 
2.3.0.2.6.5.0-292

>>print(type(df4),len(df4.columns),df4.count(),
(<class 'pyspark.sql.dataframe.DataFrame'>, 13, 296327)

>>df4.printSchema()
 root
  |-- id: string (nullable = true)
  |-- gender: string (nullable = true)
  |-- race: string (nullable = true)
  |-- age: double (nullable = true)
  |-- status: integer (nullable = true)
  |-- height: decimal(6,2) (nullable = true)
  |-- city: string (nullable = true)
  |-- county: string (nullable = true)
  |-- zipcode: string (nullable = true)
  |-- health: double (nullable = true)
  |-- physical_inactivity: double (nullable = true)
  |-- exercise: double (nullable = true)
  |-- weight: double (nullable = true)

  >>df4.limit(2).show()
+------+------+------+----+-------+-------+---------+-------+-------+------+-------------------+--------+------------+
|id    |gender|race  |age |status |height | city    |county |zipcode|health|physical_inactivity|exercise|weight      |
+------+------+------+----+-------+-------+---------+-------+-------+------+-------------------+--------+------------+
| 90001|  MALE| WHITE|61.0|      0|  70.51|DALEADALE|FIELD  |  29671|  null|               29.0|    49.0|       162.0|
| 90005|  MALE| WHITE|82.0|      0|  71.00|DALEBDALE|FIELD  |  36658|  16.0|               null|    49.0|       195.0|
+------+------+------+----+-------+-------+---------+-------+-------+------+-------------------+--------+------------+
*had to mask few features due to data privacy concerns

Error:-

>>df4.limit(10).toPandas()

'int' object is not iterable
Traceback (most recent call last):
  File "/repo/python2libs/pyspark/sql/dataframe.py", line 1968, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
  File "/repo/python2libs/pyspark/sql/dataframe.py", line 467, in collect
return list(_load_from_socket(sock_info,     BatchedSerializer(PickleSerializer())))
  File "/repo/python2libs/pyspark/rdd.py", line 142, in _load_from_socket
port, auth_secret = sock_info
TypeError: 'int' object is not iterable
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Our custom repository of libraries had a package for pyspark which was clashing with the pyspark that is provided by the spark cluster and somehow having both works on Spark shell but does not work on a notebook.
So, renaming the pyspark library in the custom repository resolved the issue!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...