I'm implementing a model in Spark as a python class, and any time I try to map a class method to an RDD it fails. My actual code is more complicated, but this simplified version gets at the heart of the problem:
class model(object):
def __init__(self):
self.data = sc.textFile('path/to/data.csv')
# other misc setup
def run_model(self):
self.data = self.data.map(self.transformation_function)
def transformation_function(self,row):
row = row.split(',')
return row[0]+row[1]
Now, if I run the model like so (for example):
test = model()
test.run_model()
test.data.take(10)
I get the following error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I've played with this a bit, and it seems to reliably occur anytime I try to map a class method to an RDD within the class. I have confirmed that the mapped function works fine if I implement outside of a class structure, so the problem definitely has to do with the class. Is there a way to resolve this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…