I am using Google Cloud Dataflow with the Python SDK.
I would like to :
- Get a list of unique dates out of a master PCollection
- Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery.
How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object :
class ToUniqueList(beam.CombineFn):
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
if element not in accumulator:
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
return list(set(accumulators))
def extract_output(self, accumulator):
return accumulator
def get_list_of_dates(pcoll):
return (pcoll
| 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))
Am I doing it all wrong ? What is the best way to do that ?
Thanks.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…