Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
248 views
in Technique[技术] by (71.8m points)

python - How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed.

Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file.

Is there any way to write file-wide Parquet metadata with pyarrow?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Pyarrow maps the file-wide metadata to a field in the table's schema named metadata. Regrettably there is not (yet) documentation on this.

Both the Parquet metadata format and the Pyarrow metadata format represent metadata as a collection of key/value pairs where both key & value must be strings. This is unfortunate as it would be more flexible if it were just a UTF-8 encoded JSON object. Furthermore, since these are std::string objects in the C++ implementation they are "b strings" (bytes) objects in Python.

Pyarrow currently stores some of its own information in the metadata field. It has one built in key b'ARROW:schema' and another builtin key b'pandas'. In the pandas case the value is a JSON object encoded with UTF-8. This allows for namespacing. The "pandas" schema can have as many fields as it needs and they are all namespaced under "pandas". Pyarrow uses the "pandas" schema to store information about what kind of index the table has as well as what type of encoding a column uses (when there is more than one possible pandas encoding for a given data type). I am uncertain what the b'ARROW:schema' represents. It appears to be encoded in some way I don't recognize and I have not really played around with it. I assume it's intended to record similar things to the "pandas" schema.

The last thing we need to know to answer your question is that all pyarrow objects are immutable. So there is no way to simply add fields to the schema. Pyarrow does have the schema utility method with_metadata which returns a clone of a schema object but with your own metadata but this replaces the existing metadata and does not append to it. There is also the experimental method on the Table object replace_schema_metadata but this also replaces and does not update. So if you want to keep the existing metadata you have to do some more work. Putting this all together we get...

custom_metadata = {'Sample Number': '12', 'Date Obtained': 'Tuesday'}
existing_metadata = table.schema.metadata
merged_metadata = { **custom_metadata, **existing_metadata }
fixed_table = table.replace_schema_metadata(merged_metadata)

Once this table is saved as a parquet file it will include the key/value metadata fields (at the file level) for Sample Number and Date Obtained.

Also, note that the replace_schema_metadata and with_metadata methods are tolerant of taking in regular python strings (like in my example). However, it will convert these to "b strings" so if you want to access fields in the schema you must use the "b string". For example, if you had just read in a table and wanted to get the sample number you must use table.schema.metadata[b'Sample Number'] and table.schema.metadats['Sample Number'] will give you a KeyError.

As you start to use this you may realize it is a pain to constantly have to be mapping Sample Number back and forth to an integer. Furthermore, if your metadata is represented in your application as a large nested object it can be a pain to map this object to a collection of string/string pairs. Also, it's a pain to constantly be remembering the "b string" keys. The solution is to do the same thing the pandas schema does. First convert your metadata to a JSON object. Then convert the JSON object to a "b string".

custom_metadata_json = {'Sample Number': 12, 'Date Obtained': 'Tuesday'}
custom_metadata_bytes = json.dumps(custom_metadata_json).encode('utf8')
existing_metadata = table.schema.metadata
merged_metadata = { **{'Record Metadata': custom_metadata_bytes}, **existing_metadata }

Now you can have as many metadata fields as you want, nested in any way you want, using any of the standard JSON types and it will all be namespaced into a single key/value pair (in this case named "Record Metadata").


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...