I'm deploying a Spark Structured Streaming application to Google Kubernetes Engine and while accessing a bucket using gs://
URI scheme I'm facing the following exception:
Exception in thread "main" java.lang.NullPointerException: projectId must not be null
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createBucket(GoogleCloudStorageImpl.java:437)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorage.createBucket(GoogleCloudStorage.java:88)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirsInternal(GoogleCloudStorageFileSystem.java:456)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:444)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:911)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2275)
at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:137)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExecution.scala:50)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:317)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:359)
at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:466)
at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:456)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:301)
at meetup.SparkStreamsApp$.delayedEndpoint$meetup$SparkStreamsApp$1(SparkStreamsApp.scala:25)
at meetup.SparkStreamsApp$delayedInit$body.apply(SparkStreamsApp.scala:7)
I'm pretty sure it's related to a service account to access and create subdirectories in the bucket that I've been using while spark-submit
the Spark app locally using GOOGLE_APPLICATION_CREDENTIALS
environment variable and spark.hadoop.google.cloud.auth.service.account.enable=true
configuration property.
I'm deploying the Spark application as follows:
./bin/spark-submit
--master k8s://$K8S_SERVER
--deploy-mode cluster
--name $POD_NAME
--class meetup.SparkStreamsApp
--conf spark.kubernetes.driver.request.cores=400m
--conf spark.kubernetes.executor.request.cores=100m
--conf spark.kubernetes.container.image=$SPARK_IMAGE
--conf spark.kubernetes.driver.pod.name=$POD_NAME
--conf spark.kubernetes.namespace=$K8S_NAMESPACE
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
--conf spark.kubernetes.submission.waitAppCompletion=false
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
--conf spark.hadoop.google.cloud.auth.service.account.enable=true
--verbose
local:///opt/spark/jars/meetup.spark-streams-demo-0.1.0.jar $BUCKET_NAME
How to fix it in a proper Kubernetes / GKE-way?
question from:
https://stackoverflow.com/questions/66052277/how-to-fix-nullpointerexception-projectid-must-not-be-null-in-spark-applicati 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…