Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
341 views
in Technique[技术] by (71.8m points)

google cloud platform - Scalable way to read large numbers of files with Apache Beam?

I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename.

It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json').

It dosen’t even show up in the Dataflow jobs list.

The pipeline looks like this:

with beam.Pipeline(options=pipeline_options) as pipeline:
    meta = (
        pipeline
        | "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
        | "Extract metadata" >> beam.ParDo(ExtractMetaData())
    )
    table_spec = bigquery.TableReference(
        datasetId="sentinel_2",
        tableId="image_labels",
    )
    (
        meta 
        | "Write To BigQuery" >> beam.io.WriteToBigQuery(
            table_spec,
            schema=table_schema(),
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
        )
    )

I'm wondering:

  1. Is there a smarter way to read large numbers of files?
  2. Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
  3. Is the way to go to match the file names first with apache_beam.io.fileio.MatchAll and then read and extract in one or two following ParDos?
question from:https://stackoverflow.com/questions/65842305/scalable-way-to-read-large-numbers-of-files-with-apache-beam

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is probably due to the pipeline running into Dataflow API limits when splitting the text source glob into a large number of sources.

Current solution is to use the transform ReadAllFromText which should not run into this.

In the future we hope to update transform ReadFromText for this case as well by using the Splittable DoFn framework.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...