I have written a class implementing a classifier in python. I would like to use Apache Spark to parallelize classification of a huge number of datapoints using this classifier.
- I'm set up using Amazon EC2 on a cluster with 10 slaves, based off an ami that comes with python's Anaconda distribution on it. The ami lets me use IPython Notebook remotely.
- I've defined the class BoTree in a file call BoTree.py on the master in the folder /root/anaconda/lib/python2.7/ which is where all my python modules are
- I've checked that I can import and use BoTree.py when running command line spark from the master (I just have to start by writing import BoTree and my class BoTree becomes available
- I've used spark's /root/spark-ec2/copy-dir.sh script to copy the /python2.7/ directory across my cluster.
- I've ssh-ed into one of the slaves and tried running ipython there, and was able to import BoTree, so I think the module has been sent across the cluster successfully (I can also see the BoTree.py file in the .../python2.7/ folder)
- On the master I've checked I can pickle and unpickle a BoTree instance using cPickle, which I understand is pyspark's serializer.
However, when I do the following:
import BoTree
bo_tree = BoTree.train(data)
rdd = sc.parallelize(keyed_training_points) #create rdd of 10 (integer, (float, float) tuples
rdd = rdd.mapValues(lambda point, bt = bo_tree: bt.classify(point[0], point[1]))
out = rdd.collect()
Spark fails with the error (just the relevant bit I think):
File "/root/spark/python/pyspark/worker.py", line 90, in main
command = pickleSer.loads(command.value)
File "/root/spark/python/pyspark/serializers.py", line 405, in loads
return cPickle.loads(obj)
ImportError: No module named BoroughTree
Can anyone help me? Somewhat desperate...
Thanks
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…