Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

bash - shuffle a file that is too large to fit in memory

I've got a file that's too large to fit in memory. shuf seems to run in RAM, and sort -R doesn't shuffle (identical lines end up next to each other; I need all of the lines to be shuffled). Are there any options other than rolling my own solution?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using a form of decorate-sort-undecorate pattern and awk you can do something like:

$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s
", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4

For a file, you would do:

$ awk 'BEGIN{srand();} {printf "%06d %s
", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT

or cat the file at the start of the pipeline.

This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.

You can increase that randomization, if desired, in several ways:

  1. If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f%s ", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.

  2. If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d%s ", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...