python - How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

Question

Welcome To Ask or Share your Answers For Others

python - How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.

if hd_file_name:
        """
        HDF5 output file specified.
        """

        hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
        print hdf_output

        columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 
                   'response_size', 'referrer', 'user_agent', 'response_time']

        source_name = str(log_file.name.rsplit('/')[-1])   # HDF5 Tables don't play nice with unicode so explicit str(). :(

        batch = []

        for count, line in enumerate(log_file,1):
            data = parse_line(line, rejected_output = reject_output)

            # Add our source file name to the beginning.
            data.insert(0, source_name )    
            batch.append(data)

            if not (count % 10):
                df = pd.DataFrame( batch, columns = columns )
                hdf_output.append(KEY_NAME, df)
                batch = []

        if (count % 10):
            df = pd.DataFrame( batch, columns = columns )
            hdf_output.append(KEY_NAME, df)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:47:34+0000

You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.

import pandas as pd
import numpy as np
import os

files = ['test1.csv','test2.csv']
for f in files:
    pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)

path = 'test.h5'
if os.path.exists(path):
    os.remove(path)

with pd.get_store(path) as store:
    for f in files:
        df = pd.read_csv(f,index_col=0)
        try:
            nrows = store.get_storer('foo').nrows
        except:
            nrows = 0

        df.index = pd.Series(df.index) + nrows
        store.append('foo',df)


In [10]: pd.read_hdf('test.h5','foo')
Out[10]: 
           A         B
0   0.772017  0.153381
1   0.304131  0.368573
2   0.995465  0.799655
3  -0.326959  0.923280
4  -0.808376  0.449645
5  -1.336166  0.236968
6  -0.593523 -0.359080
7  -0.098482  0.037183
8   0.315627 -1.027162
9  -1.084545 -1.922288
10  0.412407 -0.270916
11  1.835381 -0.737411
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798
15  1.181344  0.354411
16  0.501892 -0.358361
17  0.633256  0.419397
18  0.932354 -0.603932
19 -0.341135  2.453220

You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.

In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]: 
           A         B
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798

Categories

python - How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

python - How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags