Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
773 views
in Technique[技术] by (71.8m points)

machine learning - Run ML algorithm inside map function in Spark

So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:

AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?

Obviously I cannot reference SparkContext inside the apply_classifier function. My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:

def apply_classifier(clf):
    dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
    if clf == 0:
        clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
    elif clf == 1:
        clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)

classifiers = [0, 1]

sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect() 

I have tried using flatMap instead of map but I get NoneType object is not iterable.

I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier function. Finally, is it possible to do what I am trying to do? What are the alternatives?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

is it possible to do what I am trying to do?

It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame.

What are the alternatives?

This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:

  • Use Spark only as task management tool to train local, non-distributed models. It looks like you explored this path to some extent already. For more advanced implementation of this approach you can check spark-sklearn.

    In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.

  • Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading or joblib.

    While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.

  • Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.

    While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...