The awk '!x[$0]++'
trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.
However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl
in one line and it is quite similar to the awk
script.
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt
Here I used md5_base64
instead of md5_hex
because the base64 encoding takes 22 bytes, while the hex representation 32.
However, since the Perl implementation of hashes
still requires around 120bytes for each key, you may quickly run out of memory for your huge file.
The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel
:
cat huge.txt | pv |
parallel --pipe --keep-order --block 100M -j4 -q
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt
The --block 100M
option tells parallel to process the input in chunks of 100MB. -j4
means start 4 processes in parallel. An important argument here is --keep-order
since you want the unique lines output to remain in the same order. I have included pv
in the pipeline to get some nice statistics while the long running process is executing.
In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).
Other options include:
- Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
- Sort your huge file with an external merge sort or distribution/bucket sort
- Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
- Or use bloom filters
Here is an example of using the Bloom::Faster module of Perl:
perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt
You may install Bloom::Faster
from cran
(sudo cran
and then install "Bloom::Faster"
)
Explanation:
- You have to specify the probabilistic error rate
e
and the number of available buckets n
. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
- The
$f->add($_)
function adds the hash of a line to the filter and returns true
if the key (i.e. the line here) is a duplicate.
- You can get an estimation of the number of unique lines in your file, parsing a small section of your file with
dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l
(400MB) and multiplying that number by 100 (40GB). Then set the n
option a little higher to be on the safe side.
In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel
suggestion above to utilize multiple cores and achieve a higher throughput.