Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
280 views
in Technique[技术] by (71.8m points)

Read files in a directory starting by a specific date using python

I have an automated process and need to perform some operations with files. Another process creates these files and stores them in a directory, I only need to work with recent files but have to leave these files in there and never delete them, because of the amount of files I think the process is starting to use a lot of resources when I get the files needed.

My initial idea was to create another process that copies the most recent files(with an extra day just to be sure) to another folder but I just wondering(or I'm sure hehe) if there's a better way to get these files without reading all of them or if my code can be optimized.

My main issue is that when I get to this part of the code, the CPU usage of the server is getting of the charts and I assume that at some point the process will just break due to some OS error. I just need to get the names of the files needed, which are the ones where the creation date is greater than the last file I used, Every time I perform an operation on a file the name goes to a table in a DB which is where I get the name of the last file. My issue isn't with the queries or the operations performed, the CPU usage it's minimum, just this part where I read all the files and compare the dates of them and add them to an array.

Here's my code(don't get to angry if it's horrendous) the heavy load starts after the for:

def get_ordered_files():
    valid_files = []
    epoch = datetime.datetime.utcfromtimestamp(0)
    get_last_file = check_last_successful_file()
    last_date = os.path.getctime(get_last_file)
    files = glob.glob(files_location + file_extension)
    files.sort(key=os.path.getctime, reverse=False)
    for single_file in files:
        total_days_file = datetime.datetime.fromtimestamp(os.path.getctime(single_file)) - epoch
        total_days_last = datetime.datetime.fromtimestamp(last_date) - epoch
        if total_days_file.total_seconds() > total_days_last.total_seconds():
            check_empty = get_email_account(single_file)
            if check_empty != "" and check_empty is not None:
                valid_files.append(single_file)
    return valid_files

Thank you very much for all your help(I'm using python 3.8).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There are a lot of redundant operations going on in your code.

For example, the use of fromtimestamp() to calculate total_days_last inside the loop can simply be done once outside of the loop. In fact, the use of datetime functions and mucking about with epoch seems unnecessary because you can simply compare the file ctime values directly.

os.path.getctime() is called twice on every file: once for the sort and a second time to calculate total_days_file.

These repetitive calculations over a large number of files would be part of the performance problem.

Another issue is that, if there are a large number of files, the list files could become very large and require a lot of memory.

if check_empty != "" and check_empty is not None: can simply be written as if check_empty:

Here is a simplified version:

def get_ordered_files():
    last_ctime = os.path.getctime(check_last_successful_file())
    files = glob.glob(files_location + file_extension)
    files.sort(key=os.path.getctime)
    return [f for f in files
                if os.path.getctime(f) > last_ctime and get_email_account(f)]

This eliminates most of the redundant code but still calls os.path.getctime() twice for each file. To avoid that we can store the ctime for each file on the first occasion it is obtained.

pattern = os.path.join(files_location, file_extension)

def get_ordered_files():
    last_ctime = os.path.getctime(check_last_successful_file())
    files = ((filename, ctime) for filename in glob.iglob(pattern)
                if (ctime := os.path.getctime(filename)) > last_ctime and
                    get_email_account(filename))
    return (filename for filename, _ in sorted(files, key=itemgetter(1)))

Here a generator expression is assigned to files. It uses glob.iglob() which is an iterator version of glob.glob() that does not store all the files at once. Both the file name and its ctime value are stored as tuples. The generator expression filters out files that are too old and files that don't have an associated email account. Finally another generator is returned that sorts the files by ctime. The calling code can then iterate over the generator, or call list() on it to realise it as a list.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...