You can explode
and then group by the itemId, and use collect_list
to get the array of contentId:
import pyspark.sql.functions as F
df.show(truncate=False)
+------------------------------+------+
|contents |itemId|
+------------------------------+------+
|[[content1, 1], [content2, 2]]|item1 |
|[[content3, 3], [content4, 4]]|item2 |
+------------------------------+------+
result = df.select('itemId', F.explode('contents').alias('contents'))
.groupBy('itemId')
.agg(F.collect_list('contents.contentId').alias('contents'))
result.show()
+------+--------------------+
|itemId| contents|
+------+--------------------+
| item2|[content3, content4]|
| item1|[content1, content2]|
+------+--------------------+
Alternatively, you can use transform
if you have Spark 3.0 or above:
import pyspark.sql.functions as F
result = df.select('itemId', F.expr("transform(contents, x -> x.contentId)").alias('contents'))
result.show()
+------+--------------------+
|itemId| contents|
+------+--------------------+
| item1|[content1, content2]|
| item2|[content3, content4]|
+------+--------------------+
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…