Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
144 views
in Technique[技术] by (71.8m points)

python - Filtering text file based on values in some lines using awk

I am dealing with a text file and each record in the file is separated by blank line. I want to extract the records which meats certain criteria.

For example, my text file looks like this

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP

#EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00)
20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}

#EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00)
22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}

#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36) 
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

Now, I want to extract the records with score greater 1000. I want to remove second and third record which has sccore-432 score(432.00)and score-846 score(846.00)

I have written awk code

awk -F '[()]' '{if ($4 > 1000) print $0}' input.out

but it is giving only first line as output. i.e

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36) 

But I want to extract complete record corresponding to the score greater than 1000. Please help to extract complete record

question from:https://stackoverflow.com/questions/65662237/filtering-text-file-based-on-values-in-some-lines-using-awk

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may use this awk with an empty RS and match function:

awk -v RS= 'match($0, /score([^)]+)/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {ORS = RT; print}' file

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP

#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

A more readable version:

awk -v RS= '
match($0, /score([^)]+)/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {
   ORS = RT
   print
}' file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...