Before throwing more CPUs at your problem, you should invest some time in inspecting which parts of your code are slow.
In your case, you are executing the expensive conversion seq_df = pd.DataFrame(seqList)
in every loop iteration. This is just wasting CPU time as the result seq_df
is overwritten in the next iteration.
Your code took over 15 minutes on my machine. After moving pd.DataFrame(seqList)
and the print
statement out of the loop it is down to ~15 seconds.
def fasta2df(infile):
records = SeqIO.parse(infile, 'fasta')
seqList = []
for record in records:
desp = record.description
seq = list(record.seq._data.upper())
seqList.append([desp] + seq)
seq_df = pd.DataFrame(seqList)
seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
return seq_df
In fact, almost all time is spend in the line seq_df = pd.DataFrame(seqList)
- about 13 seconds for me. By setting the dtype explicitly to string, we can bring it down to ~7 seconds:
def fasta2df(infile):
records = SeqIO.parse(infile, 'fasta')
seqList = []
for record in records:
desp = record.description
seq = list(record.seq._data.upper())
seqList.append([desp] + seq)
seq_df = pd.DataFrame(seqList, dtype="string")
seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
return seq_df
With this new performance, I highly doubt that you can improve the speed any further by parallel processing.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…