TL;DR There are many legitimate applications of the getOrCreate
methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate
. The method has its applications, and although there some caveats, most notably:
- In its simplest form it doesn't allow you to set job specific properties, and the second variant (
(SparkConf) => SparkContext
) requires passing SparkConf
around, which is hardly an improvement over keeping SparkContext
/ SparkSession
in the scope.
- It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext
getOrCreate
you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext
on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext
initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex
outside the driver. It is not because SparkContext
is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
- Describes processing pipeline in a way that can be translated into actual task.
- Enables graceful recovery in case of task failures.
- Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext
cyclic dependencies are not an issue - RDDs
and Datasets
exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext
creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.