python - Efficiently write a Pandas dataframe to Google BigQuery

Question

Welcome To Ask or Share your Answers For Others

python - Efficiently write a Pandas dataframe to Google BigQuery

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Efficiently write a Pandas dataframe to Google BigQuery

I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

My question is, what is faster?

Upload Dataframe using pandas.DataFrame.to_gbq() function
Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

Update:

Alternative 2 takes longer than Alternative 1 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    
    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - Efficiently write a Pandas dataframe to Google BigQuery

python - Efficiently write a Pandas dataframe to Google BigQuery

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags