I am working with a lot of different csv files read in as pandas dataframes and then extract interesting indizes and data from them and collect it in a new dataframe, which I build line by line and then save. Each line represents the information from one file.
The original dataframes are indexed on millisecond precision epoch times. Although the times are unnecessarily precise, I cannot change anything about it.
>>> df.index
Int64Index([1382441313687, 1382441314687, 1382441315687, 1382441316687,
1382441317687, 1382441318687, 1382441319687, 1382441320687,
1382441321687, 1382441322687,
...
1382445583687, 1382445584687, 1382445585687, 1382445586687,
1382445587687, 1382445588687, 1382445589687, 1382445590687,
1382445591687, 1382445592687],
dtype='int64', name=u'time', length=4280)
I build the new dataframe by building a list of interesting values and creating a series out of it, which I then append to the dataframe.
columns = ['Start time', 'End time']
summary = pd.DataFrame(columns=columns)
for i, df in enumerate(long_list_of_dfs):
start_time = df.index[0]
end_time = df.index[-1]
data = [start_time, end_time]
new_line = pd.Series({key:val for key, val in zip(columns, data)})
summary = summary.append(new_line)
summary.to_csv(out_dir)
I use the saved indizes from the summary to quickly index interesting points in the original dataframes. However, while building the new dataframe, some precision gets lost, and I end up with the following:
>>> for line in open(out_dir):
... print(line)
,Start time,End time
0,1.38244131369e+12,138244559269e+12
When reading in this summary again, I cannot use these values to index the original dataframes anymore, as it results in a KeyError. This does not happen when building the dataframe directly:
>>> summary2 = pd.DataFrame({'Start time':[1382441313687], 'End time':[1382445592687]})
>>> summary2
End time Start time
0 1382445592687 1382441313687
>>> summary2.to_csv(out_dir)
>>> for line in open(out_dir):
... print(line)
,Start time,End time
0,1382441313687,1382445592687
Does anyone know why this conversion happens? I know I could specify datatypes, but I have a lot of columns with different datatypes and would rather save the hassle. I feel it would also be more intuitive if the values would remain in the original formatting.
EDIT
I want to emphasize that I build the Dataframe inside a for loop because I have many data points of interest I want to add per each line. Also, the amount of original dataframes is rather high (~90.000 files @ 20MB each), so I want to open each file only once.
The code above is just a working example to show that despite the data being integer, the last two digits get rounded off, presumably in the append line. The new_line
Series still has the data in its original format, up to the last two digits.
Below is is the output of summary.info() for the first 10 lines. As you can see, there are columns that contain NaNs, but also some that don't. I would expect the columns without NaNs to retain their integer format.
>>> summary.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 88158 entries, 0 to 88157
Data columns (total 46 columns):
Date added 88158 non-null object
Filename 88158 non-null object
ACID 88158 non-null int64
FLID 88158 non-null int64
Flag 88158 non-null object
L ESN 86986 non-null float64
R ESN 86986 non-null float64
Start time 88158 non-null float64
End time 88158 non-null float64
Total duration 88158 non-null float64
EDIT2
Here is another short example to show my problem when building a dataframe line by line with long integers.
>>> df = pd.DataFrame(columns=['a', 'b'])
>>> df.loc[len(df.index)] = [1382441313687, 1382441314687]
>>> df
a b
0 1.382441e+12 1.382441e+12
>>> df.loc[0, 'a']
1382441313687.0 # Correct data!
>>> df.to_csv(out_dir)
>>> for line in open(out_dir):
... print(line)
,a,b
0,1.38244131369e+12,1.38244131469e+12 # Not correct! 1382441313690 != 1382441313687
See Question&Answers more detail:
os