Search Built-in Algorithms
You could consider looking up RDD-based built-in clustering algorithms first since they are usually common and were released via rigorous validation process.
Clustering - RDD-based API
If you're more familiar with DataFrame-based API, then you could go here for a glance. And you might want to keep in mind as of Spark 2.0, the RDD-based APIs in the spark.mllib
package have entered maintenance mode (no new features, only bug fixes). The primary ML API is now the DataFrame-based API in the spark.ml
package.
Implement Yourself
Pandas UDFs
If you do have a model object already, consider Pandas UDFs since they have iterator support now (Since 3.0.0). Simply saying, it means a model won't be loaded for each row.
from pyspark.sql.functions import pandas_udf
@pandas_udf(...)
def categorize(iterator):
model = ... # load model
for features in iterator:
yield model.predict(features)
"""
GROUP BY in Spark SQL or window functions can be considered.
It depends on your scenarios, just remember DataFrames are still based on RDDs.
They are immutable and are high-level abstraction.
"""
spark_df.withColumn("clustered_result", categorize("some_column")).show()
RDD Exploring
If, unfortunately, your intentional execution of the clustering algorithm is not included in the set of Spark built-in clustering algorithms and won't have a progress of training which means the generation of a model. You could consider converting the Pandas DataFrame into RDD data structures, then implementing your clustering algorithm. A rough process will be like the following:
pandas_df = ....
spark_df = spark.createDataFrame(pandas_df)
.
.
clustering_result = spark_df.rdd.map{p => cluster_algorithm(p)}
note1: It's only a rough progress, you might want to partition the whole dataset into few RDDs based on region
then execute the clustering algorithm in each partitioned RDDs. Because the information of the clustering algorithm kinda not too clear, I could only give the advice based on some assumptions.
note2: RDD implementation should be your last option
- RDD Programming Guide
- 2017, Chen Jin, A Scalable Hierarchical Clustering Algorithm Using Spark
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…