google cloud platform - Scalable way to read large numbers of files with Apache Beam?

Question

Welcome To Ask or Share your Answers For Others

google cloud platform - Scalable way to read large numbers of files with Apache Beam?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

google cloud platform - Scalable way to read large numbers of files with Apache Beam?

I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename.

It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json').

It dosen’t even show up in the Dataflow jobs list.

The pipeline looks like this:

with beam.Pipeline(options=pipeline_options) as pipeline:
    meta = (
        pipeline
        | "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
        | "Extract metadata" >> beam.ParDo(ExtractMetaData())
    )
    table_spec = bigquery.TableReference(
        datasetId="sentinel_2",
        tableId="image_labels",
    )
    (
        meta 
        | "Write To BigQuery" >> beam.io.WriteToBigQuery(
            table_spec,
            schema=table_schema(),
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
        )
    )

I'm wondering:

Is there a smarter way to read large numbers of files?
Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
Is the way to go to match the file names first with apache_beam.io.fileio.MatchAll and then read and extract in one or two following ParDos?

question from:https://stackoverflow.com/questions/65842305/scalable-way-to-read-large-numbers-of-files-with-apache-beam

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:31:29+0000

This is probably due to the pipeline running into Dataflow API limits when splitting the text source glob into a large number of sources.

Current solution is to use the transform ReadAllFromText which should not run into this.

In the future we hope to update transform ReadFromText for this case as well by using the Splittable DoFn framework.

Categories

google cloud platform - Scalable way to read large numbers of files with Apache Beam?

google cloud platform - Scalable way to read large numbers of files with Apache Beam?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags