Browsing through the souce code, it looks like the Python driver code puts the value of the Python executable path from its Spark context when creating work items for running Python functions in spark/rdd.py
:
def _wrap_function(sc, func, deserializer, serializer, profiler=None):
assert deserializer, "deserializer should not be empty"
assert serializer, "serializer should not be empty"
command = (func, profiler, deserializer, serializer)
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
^^^^^^^^^^^^^
sc.pythonVer, broadcast_vars, sc._javaAccumulator)
The Python runner PythonRunner.scala
then uses the path stored in the first work item it receives to launch new interpreter instances:
private[spark] abstract class BasePythonRunner[IN, OUT](
funcs: Seq[ChainedPythonFunctions],
evalType: Int,
argOffsets: Array[Array[Int]])
extends Logging {
...
protected val pythonExec: String = funcs.head.funcs.head.pythonExec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
def compute(
inputIterator: Iterator[IN],
partitionIndex: Int,
context: TaskContext): Iterator[OUT] = {
...
val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)
...
}
...
}
Based on that, I'm afraid that it seems not currently possible to have separate configurations for the Python executable in the master and in the workers. Also see the third comment to issue SPARK-26404. Perhaps you should file an RFE with the Apache Spark project.
I'm not a Spark guru though and there might still be a way to do it, perhaps by setting PYSPARK_PYTHON
to just "python"
and then making sure the system PATH
is configured accordingly so that your Python executable comes first.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…