I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename
.
It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
.
It dosen’t even show up in the Dataflow jobs list.
The pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
| "Extract metadata" >> beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
I'm wondering:
- Is there a smarter way to read large numbers of files?
- Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
- Is the way to go to match the file names first with
apache_beam.io.fileio.MatchAll
and then read and extract in one or two following ParDo
s?
question from:
https://stackoverflow.com/questions/65842305/scalable-way-to-read-large-numbers-of-files-with-apache-beam 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…