Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
863 views
in Technique[技术] by (71.8m points)

python - How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

I am using Google Cloud Dataflow with the Python SDK.

I would like to :

  • Get a list of unique dates out of a master PCollection
  • Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery.

How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object :

class ToUniqueList(beam.CombineFn):

    def create_accumulator(self):
        return []

    def add_input(self, accumulator, element):
        if element not in accumulator:
            accumulator.append(element)
        return accumulator

    def merge_accumulators(self, accumulators):
        return list(set(accumulators))

    def extract_output(self, accumulator):
        return accumulator


def get_list_of_dates(pcoll):

    return (pcoll
            | 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))

Am I doing it all wrong ? What is the best way to do that ?

Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...