pyspark - Spark count records into specified ranges

Question

Welcome To Ask or Share your Answers For Others

pyspark - Spark count records into specified ranges

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Spark count records into specified ranges

I am trying to split a column of total count into different ranges of columns using pyspark. I am well versed with doing this in SQL but not clear on how to do it using PySpark. Glad if anyone can enlighten me on this.

I want to sort the matches columns into 3 different bins of columns where:

matches = 0,
matches => 1 & < =3,
matches => 1 & < =5

Sample DataFrame:

+-----+—-------+
|names| matches|
+-----+-—------+
|  Sam|       1| 
|  Tom|       3| 
|  Max|       5|
|  Kai|       7|
+-----+—-------+

Expected DataFrame Outcome:

+-----------+-----------+-------+
| 0 matches | lessthan3 | upto5 |
+-----------+-----------+-------+
|          0|          1|     3 |
+-----------+-----------+-------+

question from:https://stackoverflow.com/questions/65867294/spark-count-records-into-specified-ranges

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:24:28+0000

Use conditional summation as you would do in SQL :

from pyspark.sql import functions as F

df1 = df.select(
    F.sum(F.when(F.col("matches") == 0, 1).otherwise(0)).alias("0 matches"),
    F.sum(F.when(F.col("matches").between(1, 2), 1).otherwise(0)).alias("lessthan3"),
    F.sum(F.when(F.col("matches") >= 3, 1).otherwise(0)).alias("morethan3")
).drop("names")

df1.show()

#+---------+---------+---------+
#|0 matches|lessthan3|morethan3|
#+---------+---------+---------+
#|        0|        1|        3|
#+---------+---------+---------+

Another way of doing this is to group by the ranges and count:

df1 = df.withColumn(
    "range",
    F.when(F.col("matches") == 0, "0 matches")
        .when(F.col("matches").between(1, 2), "lessthan3")
        .when(F.col("matches") >= 3, "morethan3")
).groupBy("range").count()

#+---------+-----+
#|    range|count|
#+---------+-----+
#|lessthan3|    1|
#|morethan3|    3|
#+---------+-----+

Categories

pyspark - Spark count records into specified ranges

pyspark - Spark count records into specified ranges

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags