how to parallel running of python scripts on bioinformatics

Question

Welcome To Ask or Share your Answers For Others

how to parallel running of python scripts on bioinformatics

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

how to parallel running of python scripts on bioinformatics

I wish to use python to read in a fasta sequence file and convert it into a panda dataframe. I use the following scripts:

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)

The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1F5Ir5S6h9rFsVUQkDdZpomiWo9_bXtaW/view?usp=sharing

The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster? What would be the codes for that?

with many thanks!

question from:https://stackoverflow.com/questions/65877758/how-to-parallel-running-of-python-scripts-on-bioinformatics

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:22:21+0000

Before throwing more CPUs at your problem, you should invest some time in inspecting which parts of your code are slow.

In your case, you are executing the expensive conversion seq_df = pd.DataFrame(seqList) in every loop iteration. This is just wasting CPU time as the result seq_df is overwritten in the next iteration.

Your code took over 15 minutes on my machine. After moving pd.DataFrame(seqList) and the print statement out of the loop it is down to ~15 seconds.

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList)
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

In fact, almost all time is spend in the line seq_df = pd.DataFrame(seqList) - about 13 seconds for me. By setting the dtype explicitly to string, we can bring it down to ~7 seconds:

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList, dtype="string")
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

With this new performance, I highly doubt that you can improve the speed any further by parallel processing.

Categories

how to parallel running of python scripts on bioinformatics

how to parallel running of python scripts on bioinformatics

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags