Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
950 views
in Technique[技术] by (71.8m points)

apache spark - How to change the location of _spark_metadata directory?

I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code:

ds.writeStream().format("parquet").outputMode(OutputMode.Append())
                .option("queryName", "myStreamingQuery")
                .option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
                .option("path", "s3a://my-data-output-bucket-name/")
                .partitionBy("createdat")
                .start();

I get the desired output in the s3 bucket my-data-output-bucket-name but along with the output, I get the _spark_metadata folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a different S3 bucket?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

My understanding is that it is not possible up to Spark 2.3.

  1. The name of the metadata directory is always _spark_metadata

  2. _spark_metadata directory is always at the location where path option points to

I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.

Internals

The flow is that DataSource is requested to create the sink of a streaming query and takes the path option. With that, it goes to create a FileStreamSink. The path option simply becomes the basePath where the results are written to as well as the metadata.

You can find the initial commit quite useful to understand the purpose of the metadata directory.

In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...