Here's a Perl one-liner that uses a hash to hold the set of wanted keys from file1 for O(1) (amortized time) lookups per iteration over the lines of file2. So it will run in O(m+n) time, where m is number of lines in your key set, and n is the number of lines in the file you're testing.
perl -ne'BEGIN{open K,shift@ARGV;chomp(@a=<K>);@hash{@a}=()}m/^(p{alpha}+)s/&&exists$hash{$1}&&print' tkeys file2
The key set will be held in memory while file2 is tested line by line against the keys.
Here's the same thing using Perl's -a
command line option:
perl -ane'BEGIN{open G,shift@ARGV;chomp(@a=<G>);@h{@a}=();}exists$h{$F[0]}&&print' tkeys file2
The second version is probably a little easier on the eyes. ;)
One thing you have to remember here is that it's more likely that you're IO bound than processor bound. So the goal should be to minimize IO use. When the entire lookup key set is held in a hash that offers O(1) amortized lookups. The advantage this solution may have over other solutions is that some (slower) solutions will have to run through your key file (file1) one time for each line of file2. That sort of solution will be O(m*n) where m is the size of your key file, and n is the size of file2. On the other hand, this hash approach provides O(m+n) time. That's a magnitude of difference. It benefits by eliminating linear searches through the key-set, and further benefits by reading the keys via IO only one time.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…