Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
156 views
in Technique[技术] by (71.8m points)

Python pandas integer precision loss through implicit conversion

I am working with a lot of different csv files read in as pandas dataframes and then extract interesting indizes and data from them and collect it in a new dataframe, which I build line by line and then save. Each line represents the information from one file.

The original dataframes are indexed on millisecond precision epoch times. Although the times are unnecessarily precise, I cannot change anything about it.

>>> df.index
Int64Index([1382441313687, 1382441314687, 1382441315687, 1382441316687,
        1382441317687, 1382441318687, 1382441319687, 1382441320687,
        1382441321687, 1382441322687,
        ...
        1382445583687, 1382445584687, 1382445585687, 1382445586687,
        1382445587687, 1382445588687, 1382445589687, 1382445590687,
        1382445591687, 1382445592687],
       dtype='int64', name=u'time', length=4280)

I build the new dataframe by building a list of interesting values and creating a series out of it, which I then append to the dataframe.

columns = ['Start time', 'End time']
summary = pd.DataFrame(columns=columns)
for i, df in enumerate(long_list_of_dfs):
     start_time = df.index[0]
     end_time = df.index[-1]
     data = [start_time, end_time]
     new_line = pd.Series({key:val for key, val in zip(columns, data)})
     summary = summary.append(new_line)
summary.to_csv(out_dir)

I use the saved indizes from the summary to quickly index interesting points in the original dataframes. However, while building the new dataframe, some precision gets lost, and I end up with the following:

>>> for line in open(out_dir):
...     print(line)
,Start time,End time
0,1.38244131369e+12,138244559269e+12

When reading in this summary again, I cannot use these values to index the original dataframes anymore, as it results in a KeyError. This does not happen when building the dataframe directly:

>>> summary2 = pd.DataFrame({'Start time':[1382441313687], 'End time':[1382445592687]})
>>> summary2
        End time     Start time
0  1382445592687  1382441313687
>>> summary2.to_csv(out_dir)
>>> for line in open(out_dir):
...     print(line)
,Start time,End time
0,1382441313687,1382445592687

Does anyone know why this conversion happens? I know I could specify datatypes, but I have a lot of columns with different datatypes and would rather save the hassle. I feel it would also be more intuitive if the values would remain in the original formatting.

EDIT I want to emphasize that I build the Dataframe inside a for loop because I have many data points of interest I want to add per each line. Also, the amount of original dataframes is rather high (~90.000 files @ 20MB each), so I want to open each file only once.

The code above is just a working example to show that despite the data being integer, the last two digits get rounded off, presumably in the append line. The new_line Series still has the data in its original format, up to the last two digits.

Below is is the output of summary.info() for the first 10 lines. As you can see, there are columns that contain NaNs, but also some that don't. I would expect the columns without NaNs to retain their integer format.

>>> summary.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 88158 entries, 0 to 88157
Data columns (total 46 columns):
Date added            88158 non-null object
Filename              88158 non-null object
ACID                  88158 non-null int64
FLID                  88158 non-null int64
Flag                  88158 non-null object
L ESN                 86986 non-null float64
R ESN                 86986 non-null float64
Start time            88158 non-null float64
End time              88158 non-null float64
Total duration        88158 non-null float64

EDIT2 Here is another short example to show my problem when building a dataframe line by line with long integers.

>>> df = pd.DataFrame(columns=['a', 'b'])
>>> df.loc[len(df.index)] = [1382441313687, 1382441314687]
>>> df
              a             b
0  1.382441e+12  1.382441e+12
>>> df.loc[0, 'a']
1382441313687.0 # Correct data!
>>> df.to_csv(out_dir)
>>> for line in open(out_dir):
...     print(line)    
,a,b
0,1.38244131369e+12,1.38244131469e+12 # Not correct! 1382441313690 != 1382441313687
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This happens because you append a Series, which has one dtype, so if it contains 1 float, the others get cast to float too.

I only managed to reproduce your problem by slightly adapting your code

Sample data generation

columns = ['sample_data']
columns2 = ['Start time', 'End time'] + columns
long_list_of_dfs = [pd.DataFrame(index=[i**2 + j for j in range(i)], columns=columns, data=[j**2 for j in range(i)]) for i in range(5, 15)]

Adapted original code

summary2 = pd.DataFrame(columns=columns2)
for i, df in enumerate(long_list_of_dfs):
    start_time = df.index[0]
    end_time = df.index[-1]
    data = [df[k].mean() for k in columns]
    new_line = pd.Series({key:val for key, val in zip(columns2, [start_time, end_time] + data)}, name=i)
    summary2 = summary.append(new_line)
summary2.info()

result:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11 entries, 0 to 9
Data columns (total 3 columns):
Start time     11 non-null float64
End time       11 non-null float64
sample_data    11 non-null float64
dtypes: float64(3)
memory usage: 352.0 bytes

new_line

End time       209.0
Start time     196.0
sample_data     58.5
Name: 9, dtype: float64

so the conversion happens before the appending

Summary generator

A way to prevent this is not making a Series for every original DataFrame, but using a generator like this. This can be any method you use to generate the summary you need

def get_summary_data(long_list_of_dfs, columns):
    for df in long_list_of_dfs:
        s = [df[k].mean() for k in columns]
        # print(df.index[0], df.index[-1], *s)
        yield (df.index[0], df.index[-1], *s)

and then concatenating

summary = pd.DataFrame(data=get_summary_data(long_list_of_dfs, columns), columns=columns2)

result

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
Start time     10 non-null int64
End time       10 non-null int64
sample_data    10 non-null float64
dtypes: float64(1), int64(2)
memory usage: 320.0 bytes

Summary:

    Start time  End time    sample_data
0   25  29  6.000000
1   36  41  9.166667
2   49  55  13.000000
3   64  71  17.500000
4   81  89  22.666667
5   100     109     28.500000
6   121     131     35.000000
7   144     155     42.166667
8   169     181     50.000000
9   196     209     58.500000

This DataFrame can be exported using to_csv()


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...