I have identified one pandas command
timeseries.loc[z, x] = y
to be responsible for most of the time spent in an iteration. And now I am looking for better approaches to accelerate it. The loop covers not even 50k elements (and production goal is ~250k or more), but already needs a sad 20 seconds.
Here is my code (ignore the top half, it is just the timing helper)
def populateTimeseriesTable(df, observable, timeseries):
"""
Go through all rows of df and
put the observable into the timeseries
at correct row (symbol), column (tsMean).
"""
print "len(df.index)=", len(df.index) # show number of rows
global bf, t
bf = time.time() # set 'before' to now
t = dict([(i,0) for i in range(5)]) # fill category timing with zeros
def T(i):
"""
timing helper: Add passed time to category 'i'. Then set 'before' to now.
"""
global bf, t
t[i] = t[i] + (time.time()-bf)
bf = time.time()
for i in df.index: # this is the slow loop
bf = time.time()
sym = df["symbol"][i]
T(0)
tsMean = df["tsMean"][i]
T(1)
tsMean = tsFormatter(tsMean)
T(2)
o = df[observable][i]
T(3)
timeseries.loc[sym, tsMean] = o
T(4)
from pprint import pprint
print "times needed (total = %.1f seconds) for each command:" % sum(t.values())
pprint (t)
return timeseries
With (not important, not slow)
def tsFormatter(ts):
"as human readable string, only up to whole seconds"
return time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(ts))
.
.
--> The to-be-optimized code is in the for-loop.
(T, and t are just helper function & dict, for the timing.)
I have timed every step. The vast majority of time:
len(df.index)= 47160
times needed (total = 20.2 seconds) for each command:
{0: 1.102,
1: 0.741,
2: 0.243,
3: 0.792,
4: 17.371}
is spent in the last step
timeseries.loc[sym, tsMean] = o
I have already downloaded and install pypy - but sadly, that doesn't support pandas yet.
Any ideas how to speed up populating a 2D array?
Thanks!
Edit: Sorry, hadn't mentioned - 'timeseries' is a dataframe too:
timeseries = pd.DataFrame({"name": titles}, index=index)
See Question&Answers more detail:
os