Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
206 views
in Technique[技术] by (71.8m points)

python - How to trouble-shoot HDFStore Exception: cannot find the correct atom type

I am looking for some general guidance on what kinds of data scenarios can cause this exception. I have tried massaging my data in various ways to no avail.

I have googled this exception for days now, gone through several google group discussions and come up with no solution to the debugging HDFStore Exception: cannot find the correct atom type. I am reading in a simple csv file of mixed data types:

Int64Index: 401125 entries, 0 to 401124
Data columns:
SalesID                     401125  non-null values
SalePrice                   401125  non-null values
MachineID                   401125  non-null values
ModelID                     401125  non-null values
datasource                  401125  non-null values
auctioneerID                380989  non-null values
YearMade                    401125  non-null values
MachineHoursCurrentMeter    142765  non-null values
UsageBand                   401125  non-null values
saledate                    401125  non-null values
fiModelDesc                 401125  non-null values
Enclosure_Type              401125  non-null values
...................................................
Stick_Length                401125  non-null values
Thumb                       401125  non-null values
Pattern_Changer             401125  non-null values
Grouser_Type                401125  non-null values
Backhoe_Mounting            401125  non-null values
Blade_Type                  401125  non-null values
Travel_Controls             401125  non-null values
Differential_Type           401125  non-null values
Steering_Controls           401125  non-null values
dtypes: float64(2), int64(6), object(45)

Code to store the dataframe:

In [30]: store = pd.HDFStore('test0.h5','w')
In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
   ....:     store.append('df', chunk, index=False)

Note that if I use store.put on a dataframe imported in one shot, I can store it successfully, albeit slowly (I believe this is due to the pickling for object dtypes, even though the object is just string data).

Are there NaN value considerations that could be throwing this exception?

Exception:

Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa
geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo
delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou
pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi
on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe
r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co
upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid
th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_
Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] lis
t index out of range

UPDATE 1

Jeff's tip about lists stored in the dataframe led me to investigate embedded commas. pandas.read_csv is correctly parsing the file and there are indeed some embedded commas within double-quotes. So these fields are not python lists per se but do have commas in the text. Here are some examples:

3     Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons
6     Hydraulic Excavator, Track - 21.0 to 24.0 Metric Tons
8       Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons
11      Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower
12    Hydraulic Excavator, Track - 19.0 to 21.0 Metric Tons

However, when I drop this column from the pd.read_csv chunks and append to my HDFStore , I still get the same Exception. When I try to append each column individually I get the following new exception:

In [6]: for chunk in pd.read_csv('Train.csv', header=0, chunksize=50000):
   ...:     for col in chunk.columns:
   ...:         store.append(col, chunk[col], data_columns=True)

Exception: cannot properly create the storer for: [_TABLE_MAP] [group->/SalesID
(Group) '',value-><class 'pandas.core.series.Series'>,table->True,append->True,k
wargs->{'data_columns': True}]

I'll continue to troubleshoot. Here's a link to several hundred records:

https://docs.google.com/spreadsheet/ccc?key=0AutqBaUiJLbPdHFvaWNEMk5hZ1NTNlVyUVduYTZTeEE&usp=sharing

UPDATE 2

Ok, I tried the following on my work computer and got the following result:

In [4]: store = pd.HDFStore('test0.h5','w')

In [5]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
   ...:     store.append('df', chunk, index=False, data_columns=True)
   ...:

Exception: cannot find the correct atom type -> [dtype->object,items->Index([fiB
aseModel], dtype=object)] [fiBaseModel] column has a min_itemsize of [13] but it
emsize [9] is required!

I think I know what's going on here. If I take the max length of the the field fiBaseModel for the first chunk I get this:

In [16]: lens = df.fiBaseModel.apply(lambda x: len(x))

In [17]: max(lens[:10000])
Out[17]: 9

And the second chunk:

In [18]: max(lens[10001:20000])
Out[18]: 13

So the store table is created with 9-bytes for this column because that's the maximum of the the first chunk. When it encounters a longer text field in subsequent chunks, it throws the exception.

My suggestions for this would be to either truncate the data in subsequent chunks (with a warning) or allow the user to specify maximum storage for the column and truncate anything that exceeds it. Maybe pandas can do this already, I haven't had time to truly dive deep into HDFStore yet.

UPDATE 3

Trying to import a csv dataset using pd.read_csv. I pass a dictionary of all objects to the dtypes parameter. I then iterate over the file and store each chunk into the HDFStore passing a large value for min_itemsize. I get the following exception:

AttributeError: 'NoneType' object has no attribute 'itemsize'

My simple code:

store = pd.HDFStore('test0.h5','w')
objects = dict((col,'object') for col in header)

for chunk in pd.read_csv('Train.csv', header=0, dtype=objects,
    chunksize=10000, na_filter=False):
    store.append('df', chunk, min_itemsize=200)

I've tried to debug and inspected the items in the stack trace. This is what the table looks like at the exception:

ipdb> self.table
/df/table (Table(10000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=200, shape=(53,), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (24,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

UPDATE 4

Now I'm trying to iteratively determine the length of the longest string in object columns of my dataframe. This is how I do it:

    def f(x):
        if x.dtype != 'object':
            return
        else:
            return len(max(x.fillna(''), key=lambda x: len(str(x))))

lengths = pd.DataFrame([chunk.apply(f) for chunk in pd.read_csv('Train.csv', chunksize=50000)])
lens = lengths.max().dropna().to_dict()

In [255]: lens
Out[255]:
{'Backhoe_Mounting': 19.0,
 'Blade_Extension': 19.0,
 'Blade_Type': 19.0,
 'Blade_Width': 19.0,
 'Coupler': 19.0,
 'Coupler_System': 19.0,
 'Differential_Type': 12.0
 ... etc... }

Once I have the dict of max string-column lengths, I try to pass it to append via the min_itemsize argument:

In [262]: for chunk in pd.read_csv('Train.csv', chunksize=50000, dtype=types):
   .....:     store.append('df', chunk, min_itemsize=lens)

Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa
geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo
delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou
pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi
on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe
r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co
upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid
th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_
Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] [va
lues_block_2] column has a min_itemsize of [64] but itemsize [58] is required!

The offending column was passed a min_itemsize of 64, yet exception states an itemsize of 58 is required. This may be a bug?

In [266]: pd.version Out[266]: '0.11.0.dev-eb07c5a'

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The link you provided worked just fine to store the frame. Column by column just means specifiy data_columns=True. It will process the columns individually and raise on the offending one.

To diagnose

store = pd.HDFStore('test0.h5','w')
In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
   ....:     store.append('df', chunk, index=False, data_columns=True)

In production, you probably want to restrict data_columns to the columns that you want to query (could be None as well, in which case you can query only on the index/columns)

Update:

You might run into another issue. read_csv converts dtypes based on what it sees in each chunk, so with a chunksize of 10,000 the append operations failed because chunks 1 and 2 had integer looking data in some columns, then in chunk 3 you had some NaN so it because floats. Either specify upfront the dtypes, use a larger chunksize, or run your operations twice to guarantee your dtypes between chunks.

I have updated pytables.py to have a more helpful exception in this case (as well as telling you if a column has incompatible data)

thanks for the report!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...