I know that, in hindsight, it sounds silly, but the order of the arguments of spark-submit
is not in general interchangeable: all Spark-related arguments, including --py-file
, must be before the script to be executed:
# your case:
spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro.py --py-files /home/ctsats/scripts/SO/external_package.py
[...]
ImportError: No module named external_package
# correct usage:
spark-submit --master yarn-client --py-files /home/ctsats/scripts/SO/external_package.py /home/ctsats/scripts/SO/spark_distro.py
[...]
[3, 6, 9, 12]
Tested with your scripts modified as follows:
spark_distro.py
from pyspark import SparkContext, SparkConf
def import_my_special_package(x):
from external_package import external
return external(x)
conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()
external_package.py
def external(x):
return x*3
with the modifications arguably not changing the essence of the question...
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…