import - use an external library in pyspark job in a Spark cluster from google-dataproc

Question

Welcome To Ask or Share your Answers For Others

import - use an external library in pyspark job in a Spark cluster from google-dataproc

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

import - use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:

I started a ssh session with the master node of my cluster, then I input:

pyspark --packages com.databricks:spark-csv_2.11:1.2.0

Then it launched a pyspark shell in which I input:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()

And it worked.

My next step is to launch this job from my main machine using the command:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py

But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.

My question are:

was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

import - use an external library in pyspark job in a Spark cluster from google-dataproc

import - use an external library in pyspark job in a Spark cluster from google-dataproc

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags