Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
243 views
in Technique[技术] by (71.8m points)

python - Uploading DataFrame to BigQuery with Array structure

I have a pandas DataFrame with 3 columns: col1 contains lists, col2 contains dictionaries, and col3 contains NaNs:

dict_ = {'col1': [['abc'], ['def', 'ghi'], []],
         'col2': [{'k1': 'v1', 'k2': 'v2'},
                  {'k1': 'v3', 'k2': 'v4'},
                  {'k1': 'v5', 'k2': 'v6'}],
         'col3': [np.nan, np.nan, np.nan]}
df = pd.DataFrame(dict_)

Uploading the DataFrame to BigQuery I create the following schema for the first and second columns:

schema = [
bigquery.SchemaField(name="col1", field_type="STRING", mode='REPEATED'),
bigquery.SchemaField(name="col2", field_type="RECORD", mode='NULLABLE',
                     fields=[bigquery.SchemaField(name="k1", field_type="STRING", mode='NULLABLE'),
                             bigquery.SchemaField(name="k2", field_type="STRING", mode='NULLABLE')])
]
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE", schema=schema)
job = client.load_table_from_dataframe(df, table, job_config=job_config)
job.result()

The DataFrame was uploaded, but the col1 is empty.

Table Preview : enter image description here

What should I do to fix this?

question from:https://stackoverflow.com/questions/66054651/uploading-dataframe-to-bigquery-with-array-structure

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The load_table_from_dataframe method in Python client library for BigQuery serializes a DataFrame to Parquet. Unfortunately the BigQuery backend has limited support for the array data type.

As a workaround, I recommend the insert_rows_from_dataframe method.

import pandas as pd
import numpy as np
from google.cloud import bigquery


dict_ = {'col1': [['abc'], ['def', 'ghi'], []],
         'col2': [{'k1': 'v1', 'k2': 'v2'},
                  {'k1': 'v3', 'k2': 'v4'},
                  {'k1': 'v5', 'k2': 'v6'}],
         'col3': [np.nan, np.nan, np.nan]}
df = pd.DataFrame(dict_)

client = bigquery.Client()

schema = [
    bigquery.SchemaField(name="col1", field_type="STRING", mode='REPEATED'),
    bigquery.SchemaField(name="col2", field_type="RECORD", mode='NULLABLE',
                     fields=[bigquery.SchemaField(name="k1", field_type="STRING", mode='NULLABLE'),
                             bigquery.SchemaField(name="k2", field_type="STRING", mode='NULLABLE')])
]
table = bigquery.Table(
    "my-project.my_dataset.stackoverflow66054651",
    schema=schema
)
client.create_table(table)

errors = client.insert_rows_from_dataframe(table, df)
for chunk in errors:
    print(f"encountered {len(chunk)} errors: {chunk}")

loaded_df = client.query(
    # Use a query so that data is read from streaming buffer.
    "SELECT * FROM `my-project.my_dataset.stackoverflow66054651`"
).to_dataframe()
print(loaded_df)

Resources:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...