I'm working with fasta
files, where one sequence seems to be split into 2 separate entities.
As example:
>SRR6026797.1.1 1 length=251
GGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGCAGGCGGCGTGTCAAGTCGGACGTGAAAACCCCTGGCTCAACTGGGGGATGTCGTTCGAAACTGGCATGCTTGAGTGCAGGAGAGGGAAGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCTTCCTGGACTGTAACTGACGCTGAGGC
>SRR6026797.1.2 1 length=251
CCGGGAGGACTACAGCGGTATCTAATCCTGTTCGCTACCCACGCTTTCGTGCCTCAGCGTCAGTTACAGTCCAGGAAGCCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTTCACCGCTACACCGGGAATTCCGCTTCCCTCTCCTGCACTCAAGCATGCCAGTTTCGAACGCCATCCCCCAGTTGAGCCAGGGGTTTTCACGTCCGACTTGACACGCCGCCTGCGCGCCCTTTACGCCCACTC
>SRR6026797.2.1 2 length=251
GGAGTGCCAGCAGCCGCGGTAACACGTAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGTGCGTAGGCGGTATGTCAAGTCAGGTGTGAAACCCCATGGCTTAACTGTGGGCTTGCACATGAAACTGGCATGCTTGAGTGCAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGACTGACCCTGACGCTGATGT
>SRR6026797.2.2 2 length=251
CCGGGAGGACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATCAGCGTCAGGGTCAGTCCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTTCCTCTCCTGCACTCAAGCATGCCAGTTTCATGTGCAAGCCCACAGTTAAGCCATGGGGTTTCACACCTGACTTGACATACCGCCTACGCACCCTTTACGCCCAGTG
However, it should be merged as:
>SRR6026797.1 1 length=502
GGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGCAGGCGGCGTGTCAAGTCGGACGTGAAAACCCCTGGCTCAACTGGGGGATGTCGTTCGAAACTGGCATGCTTGAGTGCAGGAGAGGGAAGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCTTCCTGGACTGTAACTGACGCTGAGGCCCGGGAGGACTACAGCGGTATCTAATCCTGTTCGCTACCCACGCTTTCGTGCCTCAGCGTCAGTTACAGTCCAGGAAGCCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTTCACCGCTACACCGGGAATTCCGCTTCCCTCTCCTGCACTCAAGCATGCCAGTTTCGAACGCCATCCCCCAGTTGAGCCAGGGGTTTTCACGTCCGACTTGACACGCCGCCTGCGCGCCCTTTACGCCCACTC
>SRR6026797.2 2 length=502
GGAGTGCCAGCAGCCGCGGTAACACGTAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGTGCGTAGGCGGTATGTCAAGTCAGGTGTGAAACCCCATGGCTTAACTGTGGGCTTGCACATGAAACTGGCATGCTTGAGTGCAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGACTGACCCTGACGCTGATGTCCGGGAGGACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATCAGCGTCAGGGTCAGTCCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTTCCTCTCCTGCACTCAAGCATGCCAGTTTCATGTGCAAGCCCACAGTTAAGCCATGGGGTTTCACACCTGACTTGACATACCGCCTACGCACCCTTTACGCCCAGTG
I wonder, what is the proper way to produce output file with merged sequences (probably, using biopython
)
My current workaround is:
#!/usr/bin/python3
import os
from Bio import SeqIO
# pip install biopython
from collections import defaultdict
current_folder = os.path.dirname(__file__)
input_file = os.path.join(current_folder, "input.fasta")
output_file = os.path.join(current_folder, "output.fasta")
records = SeqIO.parse(input_file, "fasta")
records_map = defaultdict(str)
for record in records:
# kicking out from the end ".1" or ".2" in sequence name
proper_record_name = ".".join(record.name.split(".")[:-1])
# if such record exists, expend existing, otherwise : create the new one
records_map[proper_record_name] += str(record.seq)
with open(output_file, "w") as writer:
for key in records_map:
# placeholders based on template from original file and fixed information on the length
writer.write(f">{key} {key.split('.')[-1]} length={len(records_map[key])}
")
# write the glued sequence
writer.write(f"{records_map[key]}
")
It works, but I don't like processing such data as raw text (not sure if sequences in input_file
are ordered/organized properly ). Is there a better way?
Thanks!
question from:
https://stackoverflow.com/questions/65844537/how-to-merge-fasta-sequence-if-it-is-split-into-2-or-more-records 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…