apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

Question

Welcome To Ask or Share your Answers For Others

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

I'm running pyspark-sql code on Horton sandbox

18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3

# code 
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()

root
 |-- id: string (nullable = true)
 |-- cat_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- desc: string (nullable = true)
 |-- price: string (nullable = true)
 |-- url: string (nullable = true)

df1.show() 
+---+------+--------------------+----+------+--------------------+
| id|cat_id|                name|desc| price|                 url|
+---+------+--------------------+----+------+--------------------+
|  1|     2|Quest Q64 10 FT. ...|    | 59.98|http://images.acm...|
|  2|     2|Under Armour Men'...|    |129.99|http://images.acm...|
|  3|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  4|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  5|     2|Riddell Youth Rev...|    |199.99|http://images.acm...|

# When I try to get counts I get the following error.
df1.count()

**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**

# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();

I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.

The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.

Please let me know if you need more info and appreciate answer on how to handle this error.

I tried to see if name field contains any comma, as suggested by Chandan Ray. There's no comma in name field.

rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1  else False )
rdd_name_comma.count()
==> 0

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:13:31+0000

I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns. I tried using databricks package

# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0

# on pyspark 
 schema1 = StructType ([ StructField("id",IntegerType(), True), 
         StructField("cat_id",IntegerType(), True), 
         StructField("name",StringType(), True),
         StructField("desc",StringType(), True),
         StructField("price",DecimalType(), True), 
         StructField("url",StringType(), True)
         ])

df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
        df1.show()
df1.show()
    +---+------+--------------------+----+-----+--------------------+
    | id|cat_id|                name|desc|price|                 url|
    +---+------+--------------------+----+-----+--------------------+
    |  1|     2|Quest Q64 10 FT. ...|    |   60|http://images.acm...|
    |  2|     2|Under Armour Men'...|    |  130|http://images.acm...|
    |  3|     2|Under Armour Men'...|    |   90|http://images.acm...|
    |  4|     2|Under Armour Men'...|    |   90|http://images.acm...|
    |  5|     2|Riddell Youth Rev...|    |  200|http://images.acm...|

df1.printSchema()
    root
     |-- id: integer (nullable = true)
     |-- cat_id: integer (nullable = true)
     |-- name: string (nullable = true)
     |-- desc: string (nullable = true)
     |-- price: decimal(10,0) (nullable = true)
     |-- url: string (nullable = true)

df1.count()
     1345

Categories

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn&#39;t have expected number of values required by the schema

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn&#39;t have expected number of values required by the schema

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

apache spark - pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema