bash - Sampling without replacement using awk

Question

Welcome To Ask or Share your Answers For Others

bash - Sampling without replacement using awk

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

bash - Sampling without replacement using awk

I have a lot of text files that look like this:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT

Is there a way to do a sampling without replacement using awk?

For example, I have this 8 lines, and I only want to sample 4 of these randomly in a new file, without replacement. The output should look something like this:

>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT    
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT

Thanks in advance

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:21:37+0000

How about this for a random sampling of 10% of your lines?

awk 'rand()>0.9' yourfile1 yourfile2 anotherfile

I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.

Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.

For added randomness (!) you can add an srand() at the start as suggested by @klashxx

awk 'BEGIN{srand()} rand()>0.9' yourfile(s)

Categories

bash - Sampling without replacement using awk

bash - Sampling without replacement using awk

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags