Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
407 views
in Technique[技术] by (71.8m points)

python - Most efficient way to convert dictionary of lists (JSON) to tuples for pyarrow?

I am converting thousands of pandas dataframes with a complex data structure in one column to parquet files. The data used to generate the complex field in the dataframe comes in as a JSON string, which is decoded to a dictionary. The dictionary is like the following:

dict = {"a": [[5, 3, False], [1, 1, True]],
        "b": [[4, 2, True]],
        "c": [[34, 3, True], [37, 2, False]]
        ...
        }

The innermost list (let's call it a struct) always has the same structure, and there are an arbitrary number of these structs in each value in the dictionary. The dictionary also has an arbitrary number of key:value pairs, which we don't know in advance (the number or the keys).

I have created a UDT in pyarrow that matches the structure of these elements like so:

udt = pa.struct([
    pa.field('subfield1', pa.int64()),
    pa.field('subfield2', pa.int64()),
    pa.field('subfield3', pa.bool_())
    ])

The schema for the parquet file is then defined something like the following:

schema = pa.schema([
    pa.field('field1', pa.string()),
    pa.field('field2', pa.int64()),
    pa.field('field3', pa.map_(pa.string(), pa.list_(udt))
    ])

Where 'field3' is the map that stores the dictionary whose values are UDT structs.

The problem is that pyarrow expects the map type as a tuple of tuple pairs, the list type as a tuple, and the struct as a tuple. So I have to convert the JSON dictionary representation to the following:

(
  ("a", ((5, 3, False), (1, 1, True)),
  ("b", ((4, 2, True)),
  ("c", ((34, 3, True), (37, 2, False))
)

Effectively, I am changing every list to a tuple and casting the dictionary as an n-tuple of tuple pairs. I have to mutate every dictionary to this format, and then stuff this data construct back into the dataframe in order to convert it to a parquet file. I have a simple function to make this transformation (and edit one of the values on the fly), but it's inefficient to the point that I can't use it for my purposes. The function is the following:

def fun(dict_obj):
    for subfield, struct in dict_obj.items():
        for i, data in enumerate(struct):
            data[0] -= 1  # This is an integer that changes value in production
            array[i] = tuple(data)
    return tuple([(k, tuple(v)) for k, v in obj.items()])

Any ideas how I can speed this up? I've tried eval() on a tuple-formatted string (it comes from a redshift database so I can construct it by hand), but that was about 9x slower. Unfortunately, the data has to come in to python as a pandas dataframe given our current infrastructure, and the column with the complex datatype must be a JSON formatted string because of the limitations of redshift. Any help is appreciated!

question from:https://stackoverflow.com/questions/65945129/most-efficient-way-to-convert-dictionary-of-lists-json-to-tuples-for-pyarrow

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...