scala - How to pre-package external libraries when using Spark on a Mesos cluster

Question

Welcome To Ask or Share your Answers For Others

scala - How to pre-package external libraries when using Spark on a Mesos cluster

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - How to pre-package external libraries when using Spark on a Mesos cluster

According to the Spark on Mesos docs one needs to set the spark.executor.uri pointing to a Spark distribution:

val conf = new SparkConf()
  .setMaster("mesos://HOST:5050")
  .setAppName("My app")
  .set("spark.executor.uri", "<path to spark-1.4.1.tar.gz uploaded above>")

The docs also note that one can build a custom version of the Spark distribution.

My question now is whether it is possible/desirable to pre-package external libraries such as

spark-streaming-kafka
elasticsearch-spark
spark-csv

which will be used in mostly all of the job-jars I'll submit via spark-submit to

reduce the time sbt assembly need to package the fat jars
reduce the size of the fat jars which need to be submitted

If so, how can this be achieved? Generally speaking, are there some hints on how the fat jar generation on job submitting process can be speed up?

Background is that I want to run some code-generation for Spark jobs, and submit these right away and show the results in a browser frontend asynchronously. The frontend part shouldn't be too complicated, but I wonder how the backend part can be achieved.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:53:45+0000

Create sample maven project with your all dependencies and then use maven plugin maven-shade-plugin. It will create one shade jar in your target folder.

Here is sample pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com</groupId>
    <artifactId>test</artifactId>
    <version>0.0.1</version>
    <properties>
        <java.version>1.7</java.version>
        <hadoop.version>2.4.1</hadoop.version>
        <spark.version>1.4.0</spark.version>
        <version.spark-csv_2.10>1.1.0</version.spark-csv_2.10>
        <version.spark-avro_2.10>1.0.0</version.spark-avro_2.10>
    </properties>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <!-- <minimizeJar>true</minimizeJar> -->
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                                <exclude>org/bdbizviz/**</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <finalName>spark-${project.version}</finalName>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency> <!-- Hadoop dependency -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>servlet-api</artifactId>
                    <groupId>javax.servlet</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>guava</artifactId>
                    <groupId>com.google.guava</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.4</version>
        </dependency>

        <dependency> <!-- Spark Core -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark SQL -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark CSV -->
            <groupId>com.databricks</groupId>
            <artifactId>spark-csv_2.10</artifactId>
            <version>${version.spark-csv_2.10}</version>
        </dependency>
        <dependency> <!-- Spark Avro -->
            <groupId>com.databricks</groupId>
            <artifactId>spark-avro_2.10</artifactId>
            <version>${version.spark-avro_2.10}</version>
        </dependency>
        <dependency> <!-- Spark Hive -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark Hive thriftserver -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
</project>

Categories

scala - How to pre-package external libraries when using Spark on a Mesos cluster

scala - How to pre-package external libraries when using Spark on a Mesos cluster

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags