I have two array columns (names
, score
). I need to explode both of them. Make names as column name
for score(similar to pivot).
+------------+-------------------------+--------------------+
| id | names | score |
+------------+-------------------------+--------------------+
|ab01 |[F1 , F2, F3, F4, F5] |[00123, 000.001, 00127, 00.0123, 111]
|ab02 |[F1 , F2, F3, F4, F5, F6]|[00124, 000.003, 00156, 00.067, 156, 254]
|ab03 |[F1 , F2, F3, F4, F5] |[00234, 000.078, 00188, 00.0144, 188]
|ab04 |[F1 , F2, F3, F4, F5] |[00345, 000.01112, 001567, 00.0186, 555]
Expected output:
id F1 F2 F3 F4 F5 F6
ab01 00123 000.001 00127 00.0123 111 null
ab02 00124 000.003 00156 00.067 156 254
ab03 00234 000.078 00188 00.0144 188 null
ab04 00345 000.01112 001567 00.0186 555 null
I tried zipping up names and score and then exploding them
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(
StructType(
[StructField("names", StringType()),
StructField("score", StringType())
]
)
)
)
df2 = df.withColumn("new", combine("score", "names"))
.withColumn("new", F.explode("new"))
.select("id",
F.col("new.names").alias("names"),
F.col("new.score").alias("score")
)
I'm getting an error:
TypeError: zip argument #1 must support iteration
I also tried exploding using rdd flatMap()
and I still get the same error.
Is there an alternate way to achieve this?
Thanks in advance.
question from:
https://stackoverflow.com/questions/65849819/how-to-zip-two-columns-explode-them-and-finally-pivot-in-pyspark