Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
265 views
in Technique[技术] by (71.8m points)

kubernetes - How to fix "NullPointerException: projectId must not be null" in Spark application on GKE?

I'm deploying a Spark Structured Streaming application to Google Kubernetes Engine and while accessing a bucket using gs:// URI scheme I'm facing the following exception:

Exception in thread "main" java.lang.NullPointerException: projectId must not be null
    at com.google.cloud.hadoop.repackaged.gcs.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createBucket(GoogleCloudStorageImpl.java:437)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorage.createBucket(GoogleCloudStorage.java:88)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirsInternal(GoogleCloudStorageFileSystem.java:456)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:444)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:911)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2275)
    at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:137)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExecution.scala:50)
    at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:317)
    at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:359)
    at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:466)
    at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:456)
    at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:301)
    at meetup.SparkStreamsApp$.delayedEndpoint$meetup$SparkStreamsApp$1(SparkStreamsApp.scala:25)
    at meetup.SparkStreamsApp$delayedInit$body.apply(SparkStreamsApp.scala:7)

I'm pretty sure it's related to a service account to access and create subdirectories in the bucket that I've been using while spark-submit the Spark app locally using GOOGLE_APPLICATION_CREDENTIALS environment variable and spark.hadoop.google.cloud.auth.service.account.enable=true configuration property.

I'm deploying the Spark application as follows:

./bin/spark-submit 
  --master k8s://$K8S_SERVER 
  --deploy-mode cluster 
  --name $POD_NAME 
  --class meetup.SparkStreamsApp 
  --conf spark.kubernetes.driver.request.cores=400m 
  --conf spark.kubernetes.executor.request.cores=100m 
  --conf spark.kubernetes.container.image=$SPARK_IMAGE 
  --conf spark.kubernetes.driver.pod.name=$POD_NAME 
  --conf spark.kubernetes.namespace=$K8S_NAMESPACE 
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
  --conf spark.kubernetes.submission.waitAppCompletion=false 
  --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem 
  --conf spark.hadoop.google.cloud.auth.service.account.enable=true 
  --verbose 
  local:///opt/spark/jars/meetup.spark-streams-demo-0.1.0.jar $BUCKET_NAME

How to fix it in a proper Kubernetes / GKE-way?

question from:https://stackoverflow.com/questions/66052277/how-to-fix-nullpointerexception-projectid-must-not-be-null-in-spark-applicati

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The recommended way in GKE documentation is to Import credentials as a Secret :

kubectl create secret generic spark-streaming-sa --from-file=/path/spark-streaming-serviceaccount-key.json

And when you submit the job add the following configurations :

--conf spark.kubernetes.driver.secrets.spark-streaming-sa=<mount path>
--conf spark.kubernetes.executor.secrets.spark-streaming-sa=<mount path>
--conf spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS=<mount path>/spark-streaming-sa.json
--conf spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS=<mount path>/spark-streaming-sa.json
--conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=<mount path>/spark-streaming-sa.json

You can refer to the examples provided on Github GoogleCloudPlatform/spark-on-k8s-gcp-examples.

This is also described in the Secret Management section of spark docs Running Spark on Kubernetes:

Kubernetes Secrets can be used to provide credentials for a Spark application to access secured services. To mount a user-specified secret into the driver container, users can use the configuration property of the form spark.kubernetes.driver.secrets.[SecretName]=<mount path>. Similarly, the configuration property of the form spark.kubernetes.executor.secrets.[SecretName]=<mount path> can be used to mount a user-specified secret into the executor containers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...