I am converting thousands of pandas dataframes with a complex data structure in one column to parquet files. The data used to generate the complex field in the dataframe comes in as a JSON string, which is decoded to a dictionary. The dictionary is like the following:
dict = {"a": [[5, 3, False], [1, 1, True]],
"b": [[4, 2, True]],
"c": [[34, 3, True], [37, 2, False]]
...
}
The innermost list (let's call it a struct) always has the same structure, and there are an arbitrary number of these structs in each value in the dictionary. The dictionary also has an arbitrary number of key:value pairs, which we don't know in advance (the number or the keys).
I have created a UDT in pyarrow that matches the structure of these elements like so:
udt = pa.struct([
pa.field('subfield1', pa.int64()),
pa.field('subfield2', pa.int64()),
pa.field('subfield3', pa.bool_())
])
The schema for the parquet file is then defined something like the following:
schema = pa.schema([
pa.field('field1', pa.string()),
pa.field('field2', pa.int64()),
pa.field('field3', pa.map_(pa.string(), pa.list_(udt))
])
Where 'field3' is the map that stores the dictionary whose values are UDT structs.
The problem is that pyarrow expects the map type as a tuple of tuple pairs, the list type as a tuple, and the struct as a tuple. So I have to convert the JSON dictionary representation to the following:
(
("a", ((5, 3, False), (1, 1, True)),
("b", ((4, 2, True)),
("c", ((34, 3, True), (37, 2, False))
)
Effectively, I am changing every list to a tuple and casting the dictionary as an n-tuple of tuple pairs. I have to mutate every dictionary to this format, and then stuff this data construct back into the dataframe in order to convert it to a parquet file. I have a simple function to make this transformation (and edit one of the values on the fly), but it's inefficient to the point that I can't use it for my purposes. The function is the following:
def fun(dict_obj):
for subfield, struct in dict_obj.items():
for i, data in enumerate(struct):
data[0] -= 1 # This is an integer that changes value in production
array[i] = tuple(data)
return tuple([(k, tuple(v)) for k, v in obj.items()])
Any ideas how I can speed this up? I've tried eval() on a tuple-formatted string (it comes from a redshift database so I can construct it by hand), but that was about 9x slower. Unfortunately, the data has to come in to python as a pandas dataframe given our current infrastructure, and the column with the complex datatype must be a JSON formatted string because of the limitations of redshift. Any help is appreciated!
question from:
https://stackoverflow.com/questions/65945129/most-efficient-way-to-convert-dictionary-of-lists-json-to-tuples-for-pyarrow