regex - Obtain patterns in one file from another using ack or awk or better way than grep?

Question

Welcome To Ask or Share your Answers For Others

regex - Obtain patterns in one file from another using ack or awk or better way than grep?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Obtain patterns in one file from another using ack or awk or better way than grep?

Is there a way to obtain patterns in one file (a list of patterns) from another file using ack as the -f option in grep? I see there is an -f option in ack but it's different with the -f in grep.

Perhaps an example will give you a better idea. Suppose I have file1:

file1:
a
c
e

And file2:

file2:
a  1
b  2
c  3
d  4
e  5

And I want to obtain all the patterns in file1 from file2 to give:

a  1
c  3
e  5

Can ack do this? Otherwise, is there a better way to handle the job (such like awk or using hash) because I have millions of records in both files and really need an efficient way to complete? Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:05:47+0000

Here's a Perl one-liner that uses a hash to hold the set of wanted keys from file1 for O(1) (amortized time) lookups per iteration over the lines of file2. So it will run in O(m+n) time, where m is number of lines in your key set, and n is the number of lines in the file you're testing.

perl -ne'BEGIN{open K,shift@ARGV;chomp(@a=<K>);@hash{@a}=()}m/^(p{alpha}+)s/&&exists$hash{$1}&&print' tkeys file2

The key set will be held in memory while file2 is tested line by line against the keys.

Here's the same thing using Perl's -a command line option:

perl -ane'BEGIN{open G,shift@ARGV;chomp(@a=<G>);@h{@a}=();}exists$h{$F[0]}&&print' tkeys file2

The second version is probably a little easier on the eyes. ;)

One thing you have to remember here is that it's more likely that you're IO bound than processor bound. So the goal should be to minimize IO use. When the entire lookup key set is held in a hash that offers O(1) amortized lookups. The advantage this solution may have over other solutions is that some (slower) solutions will have to run through your key file (file1) one time for each line of file2. That sort of solution will be O(m*n) where m is the size of your key file, and n is the size of file2. On the other hand, this hash approach provides O(m+n) time. That's a magnitude of difference. It benefits by eliminating linear searches through the key-set, and further benefits by reading the keys via IO only one time.

Categories

regex - Obtain patterns in one file from another using ack or awk or better way than grep?

regex - Obtain patterns in one file from another using ack or awk or better way than grep?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags