filteration in txt file in python

Question

Welcome To Ask or Share your Answers For Others

filteration in txt file in python

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

filteration in txt file in python

I have too many lines like this:

>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
TTAACTATAATCCCACTGCCTATTTTTTTATTTCTAAAAATATCATAAAAAGACACAAAA

the first line(starting with >) is identifier and other lines are sequence and also each identifier has its own sequence. in the mentioned example, ENSG00000100206 is name and ENST00000216024 is isoform. in my file there are some identifier lines with the same name but everything else is different. I would like to get the longest sequence for each name and make a new file. meaning there would be only one repeat of each name (but with the longest sequence). for the above example the results would be like this:

>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC

do you guys know how to do that in python?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:26:50+0000

You can start by using Biopython to get a proper FASTA format parser: http://biopython.org/wiki/SeqIO

Then iterate over the records, and do what you want with them. This will save you not only the time to write a parser, but also will prevent you from doing it completely wrong.

Example from that very page:

from Bio import SeqIO
for record in SeqIO.parse("example.fasta", "fasta"):
    print(record.id)

Instead of a print, create a dict {record.id: record.length} that you update only if the length is longer.

Categories

filteration in txt file in python

filteration in txt file in python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags