For small (one or two local files) dependencies you can use --py-files and enumerate them, with something bigger or more dependencies - it's better to pack it in a zip or egg file.
File udfs.py
:
def my_function(*args, **kwargs):
# code
File main.py
:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from udfs import my_function
sc = SparkContext()
spark = SparkSession(sc)
my_udf = udf(my_function)
df = spark.createDataFrame([(1, "a"), (2, "b")])
df.withColumn("my_f", my_udf("..."))
For run:
pyspark --py-files /path/to/udfs.py
# or
spark-submit --py-files /path/to/udfs.py main.py
If you have written your own Python module or even third-party modules (which don't need C compilation), I personally needed it with geoip2
, it's better to create a zip or egg file.
# pip with -t install all modules and dependencies in directory `src`
pip install geoip2 -t ./src
# Or from local directory
pip install ./my_module -t ./src
# Best is
pip install -r requirements.txt -t ./src
# If you need add some additionals files
cp ./some_scripts/* ./src/
# And pack it
cd ./src
zip -r ../libs.zip .
cd ..
pyspark --py-files libs.zip
spark-submit --py-files libs.zip
Be careful when using pyspark --master yarn
(possibly with other non-local master options), in pyspark shell with --py-files
:
>>> import sys
>>> sys.path.insert(0, '/path/to/libs.zip') # You can use relative path: .insert(0, 'libs.zip')
>>> import MyModule # libs.zip/MyModule
EDIT - The answer on question of how to get functions on executors without addPyFile ()
and --py-files
:
It is necessary to have a given file with functions on individual executors. And reachable through PATH env.
Therefore, I would probably write a Python Module, which I then install on the executors and was available in the environment.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…