How to parse the CSV directly into the desired DataFrame:
Pass a dict of functions to pandas.read_csv
's converters
keyword argument:
import pandas as pd
import datetime as DT
nfp = pd.read_csv("NFP.csv",
sep=r'[s,]', # 1
header=None, skiprows=1,
converters={ # 2
0: lambda x: DT.datetime.strptime(x, '%Y%m%d'),
1: lambda x: DT.time(*map(int, x.split(':')))},
names=['Date', 'Time', 'Actual', 'Consensus', 'Previous'])
print(nfp)
yields
Date Time Actual Consensus Previous
0 2014-01-10 13:30:00 74000 196000 241000
1 2013-12-06 13:30:00 241000 180000 200000
2 2013-11-08 13:30:00 200000 125000 163000
3 2013-10-22 12:30:00 163000 180000 193000
4 2013-09-06 12:30:00 193000 180000 104000
5 2013-08-02 12:30:00 104000 184000 188000
6 2013-07-05 12:30:00 188000 165000 176000
7 2013-06-07 12:30:00 176000 170000 165000
8 2013-05-03 12:30:00 165000 145000 138000
9 2013-04-05 12:30:00 138000 200000 268000
sep=r'[s,]'
tells read_csv
to split lines of the csv on the
regex pattern r'[s,]'
-- a whitespace or a comma.
- The
converters
parameter tells read_csv
to apply the given
functions to certain columns. The keys (e.g. 0 and 1) refer to the
column index, and the values are the functions to be applied.
How to split the DataFrame after performing csv_read
import pandas as pd
nfp = pd.read_csv("NFP.csv", parse_dates=[0], infer_datetime_format=True)
temp = pd.DatetimeIndex(nfp['DateTime'])
nfp['Date'] = temp.date
nfp['Time'] = temp.time
del nfp['DateTime']
print(nfp)
Which is faster?
It depends on the size of the CSV. (Thanks to Jeff for pointing this out.)
For tiny CSVs, parsing the CSV into the desired form directly is faster than using a DatetimeIndex after parsing with parse_dates=[0]
:
def using_converter():
nfp = pd.read_csv("NFP.csv", sep=r'[s,]', header=None, skiprows=1,
converters={
0: lambda x: DT.datetime.strptime(x, '%Y%m%d'),
1: lambda x: DT.time(*map(int, x.split(':')))},
names=['Date', 'Time', 'Actual', 'Consensus', 'Previous'])
return nfp
def using_index():
nfp = pd.read_csv("NFP.csv", parse_dates=[0], infer_datetime_format=True)
temp = pd.DatetimeIndex(nfp['DateTime'])
nfp['Date'] = temp.date
nfp['Time'] = temp.time
del nfp['DateTime']
return nfp
In [114]: %timeit using_index()
100 loops, best of 3: 1.71 ms per loop
In [115]: %timeit using_converter()
1000 loops, best of 3: 914 μs per loop
However, for CSVs of just a few hundred lines or more, using a DatetimeIndex is faster.
N = 20
filename = '/tmp/data'
content = '''
DateTime,Actual,Consensus,Previous
20140110 13:30:00,74000,196000,241000
20131206 13:30:00,241000,180000,200000
20131108 13:30:00,200000,125000,163000
20131022 12:30:00,163000,180000,193000
20130906 12:30:00,193000,180000,104000
20130802 12:30:00,104000,184000,188000
20130705 12:30:00,188000,165000,176000
20130607 12:30:00,176000,170000,165000
20130503 12:30:00,165000,145000,138000
20130405 12:30:00,138000,200000,268000'''
def setup(n):
header, remainder = content.split('
', 1)
with open(filename, 'w') as f:
f.write('
'.join([header]+[remainder]*n))
In [304]: setup(50)
In [305]: %timeit using_converter()
100 loops, best of 3: 9.78 ms per loop
In [306]: %timeit using_index()
100 loops, best of 3: 9.3 ms per loop
Where can I look up this kind of information?
- Sometimes you can find examples in the Pandas Cookbook.
- Sometimes web searching or searching Stackoverflow suffices.
- Spending a weekend snowed in with nothing to do but reading the
pandas documentation will surely help too.
- Install IPython. It has tab completion and if you type a
?
after a
function, it gives you the function's docstring. Those two features
really help you introspect Python objects quickly. It also tells you in what file the function is defined (if defined in pure Python) -- which leads me to...
- Reading the source code
Just keep at it. The more you know the easier it gets.
If you give it your best shot and still can't find the answer, post a question on Stackoverflow. You'll hopefully get an answer quickly, and help others searching for the same thing.