Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
283 views
in Technique[技术] by (71.8m points)

python - Fast conversion of timestamps for duration calculation

We've got a log analyzer which parses logs on the order of 100GBs (my test file is ~20 million lines, 1.8GB). It's taking longer than we'd like (upwards of half a day), so I ran it against cProfile and >75% of the time is being taken by strptime:

       1    0.253    0.253  560.629  560.629 <string>:1(<module>)
20000423  202.508    0.000  352.246    0.000 _strptime.py:299(_strptime)

to calculate the durations between log entries, currently as:

ltime = datetime.strptime(split_line[time_col].strip(), "%Y-%m-%d %H:%M:%S")
lduration = (ltime - otime).total_seconds()

where otime is the time stamp from the previous line

The log files are formatted along the lines of:

0000 | 774 | 475      | 2017-03-29 00:06:47 | M      |        63
0001 | 774 | 475      | 2017-03-29 01:09:03 | M      |        63
0000 | 774 | 475      | 2017-03-29 01:19:50 | M      |        63
0001 | 774 | 475      | 2017-03-29 09:42:57 | M      |        63
0000 | 775 | 475      | 2017-03-29 10:24:34 | M      |        63
0001 | 775 | 475      | 2017-03-29 10:33:46 | M      |        63    

It takes almost 10 minutes to run it against the test file.

Replacing strptime() with this (from this question):

def to_datetime(d):
    ltime = datetime.datetime(int(d[:4]), 
                              int(d[5:7]), 
                              int(d[8:10]), 
                              int(d[11:13]), 
                              int(d[14:16]), 
                              int(d[17:19]))

brings that down to just over 3 minutes.

cProfile again reports:

       1    0.265    0.265  194.538  194.538 <string>:1(<module>)
20000423   62.688    0.000   62.688    0.000 analyzer.py:88(to_datetime)

this conversion is still taking about a third of the time for the entire analyzer to run. In-lining reduces the conversions footprint by about 20%, but we're still looking at 25% of the time to process these lines is converting the timestamp to datetime format (with total_seconds() consuming another ~5% on top of that).

I may end up just writing a custom timestamp to seconds conversion to bypass datetime entirely, unless someone has another bright idea?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So I kept looking and I've found a module that does a fantastic job:

Introducing ciso8601:

from ciso8601 import parse_datetime
...
ltime = parse_datetime(sline[time_col].strip())

Which, via cProfile:

       1    0.254    0.254  123.795  123.795 <string>:1(<module>)
20000423    4.188    0.000    4.188    0.000 {ciso8601.parse_datetime}

which is ~84x faster than the naive approach via datetime.strptime()... which is not surprising, given they wrote a C module to do it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...