Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
246 views
in Technique[技术] by (71.8m points)

python - How to merge fasta sequence if it is split into 2 (or more) records?

I'm working with fasta files, where one sequence seems to be split into 2 separate entities.

As example:

>SRR6026797.1.1 1 length=251
GGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGCAGGCGGCGTGTCAAGTCGGACGTGAAAACCCCTGGCTCAACTGGGGGATGTCGTTCGAAACTGGCATGCTTGAGTGCAGGAGAGGGAAGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCTTCCTGGACTGTAACTGACGCTGAGGC
>SRR6026797.1.2 1 length=251
CCGGGAGGACTACAGCGGTATCTAATCCTGTTCGCTACCCACGCTTTCGTGCCTCAGCGTCAGTTACAGTCCAGGAAGCCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTTCACCGCTACACCGGGAATTCCGCTTCCCTCTCCTGCACTCAAGCATGCCAGTTTCGAACGCCATCCCCCAGTTGAGCCAGGGGTTTTCACGTCCGACTTGACACGCCGCCTGCGCGCCCTTTACGCCCACTC
>SRR6026797.2.1 2 length=251
GGAGTGCCAGCAGCCGCGGTAACACGTAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGTGCGTAGGCGGTATGTCAAGTCAGGTGTGAAACCCCATGGCTTAACTGTGGGCTTGCACATGAAACTGGCATGCTTGAGTGCAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGACTGACCCTGACGCTGATGT
>SRR6026797.2.2 2 length=251
CCGGGAGGACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATCAGCGTCAGGGTCAGTCCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTTCCTCTCCTGCACTCAAGCATGCCAGTTTCATGTGCAAGCCCACAGTTAAGCCATGGGGTTTCACACCTGACTTGACATACCGCCTACGCACCCTTTACGCCCAGTG

However, it should be merged as:

>SRR6026797.1 1 length=502
GGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGCAGGCGGCGTGTCAAGTCGGACGTGAAAACCCCTGGCTCAACTGGGGGATGTCGTTCGAAACTGGCATGCTTGAGTGCAGGAGAGGGAAGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCTTCCTGGACTGTAACTGACGCTGAGGCCCGGGAGGACTACAGCGGTATCTAATCCTGTTCGCTACCCACGCTTTCGTGCCTCAGCGTCAGTTACAGTCCAGGAAGCCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTTCACCGCTACACCGGGAATTCCGCTTCCCTCTCCTGCACTCAAGCATGCCAGTTTCGAACGCCATCCCCCAGTTGAGCCAGGGGTTTTCACGTCCGACTTGACACGCCGCCTGCGCGCCCTTTACGCCCACTC
>SRR6026797.2 2 length=502
GGAGTGCCAGCAGCCGCGGTAACACGTAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGTGCGTAGGCGGTATGTCAAGTCAGGTGTGAAACCCCATGGCTTAACTGTGGGCTTGCACATGAAACTGGCATGCTTGAGTGCAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGACTGACCCTGACGCTGATGTCCGGGAGGACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATCAGCGTCAGGGTCAGTCCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTTCCTCTCCTGCACTCAAGCATGCCAGTTTCATGTGCAAGCCCACAGTTAAGCCATGGGGTTTCACACCTGACTTGACATACCGCCTACGCACCCTTTACGCCCAGTG

I wonder, what is the proper way to produce output file with merged sequences (probably, using biopython)

My current workaround is:

#!/usr/bin/python3

import os
from Bio import SeqIO
# pip install biopython
from collections import defaultdict

current_folder = os.path.dirname(__file__)
input_file = os.path.join(current_folder, "input.fasta")
output_file = os.path.join(current_folder, "output.fasta")

records = SeqIO.parse(input_file, "fasta")
records_map = defaultdict(str)

for record in records:
    # kicking out from the end ".1" or ".2" in sequence name
    proper_record_name = ".".join(record.name.split(".")[:-1])
    # if such record exists, expend existing, otherwise : create the new one
    records_map[proper_record_name] += str(record.seq)

with open(output_file, "w") as writer:
    for key in records_map:
        # placeholders based on template from original file and fixed information on the length
        writer.write(f">{key} {key.split('.')[-1]} length={len(records_map[key])}
")
        # write the glued sequence
        writer.write(f"{records_map[key]} 
")

It works, but I don't like processing such data as raw text (not sure if sequences in input_file are ordered/organized properly ). Is there a better way? Thanks!

question from:https://stackoverflow.com/questions/65844537/how-to-merge-fasta-sequence-if-it-is-split-into-2-or-more-records

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...