Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
371 views
in Technique[技术] by (71.8m points)

binary search in an array in Perl

I have an array of hex numbers, and I need to go over other numbers and check if they appear in that array. Right now i'm using a foreach loop that goes over the entire array each time. Is there a way to make it faster by sorting the array at first, and then implementing binary search on it.

The code at the moment:

sub is_bad_str{
  my ($str, @keys) = @_;
  my $flag = 0;
  my ($key, $hex_num);
        if ($str =~ m/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/;){ #'# fixes bad highlighting
  $hex_num = $1;
      }
  if (defined $hex_num){
    foreach $key (@keys){
        if ($hex_num =~ /Q$keyE/i){
            $flag = 1;
            last;
        }
    }
  }
  if (($flag == 0) && (defined $hex_num)){
    return 1;#Bad str
  }else{
    return 0;#Good str
      }
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There are four strategies to doing efficient bulk search in a set of data in Perl.

The full analysis is outlined below, but in summary, the best performance on average random data set with a significant # of searches is of course offered by hash lookups, followed by much-worse BST.


  1. Binary (half-interval) search of an array.

    This is obviously a standard algorithmic approach.

    Performance Costs:

    • O(N * log N) for initial sorting.
    • O(N) on average for inserting/removing data in the list once sorted. Perl arrays are NOT linked lists so it's not an O(log N).
    • O(log N) for each search.

    Implementation: the algorithm is so simple that DIY is easy. As usual, CPAN modules exist and should probably be used instead of DIY anyway: Search::Binary .


  2. Binary Search Trees (BSTs)

    Performance Costs:

    • O(N * log N) for initial sorting.
    • O(log N) on average for inserting/removing data in the list once sorted
    • O(log N) for each search.


    Implementation: several flavors exist on CPAN: Tree::Binary::Search, Tree::Treap, Tree::RedBlack. The latter two have better average performance and smaller performance fluctuations, algorithmically.

    Comparison: If the data WILL change, you must use BST to avoid re-sort costs. If your data is random and never changes once sorted, you can use simple binary search over BST but BSTs can be tuned better if every last ounce of performance matter (BST can be optimized for faster average searching than list binary search if you know your lookup costs based on data distribution - see Wiki's "Optimal binary search trees" section or if your data distribution favors one of the special trees like Treap or Red/Black).


  3. Abbreviated (short circuited) scan lookups.

    These are linear scan searches on un-sorted list which STOP searching once the item is found.

    Performance: O(N) per search for random data, but a faster O(N) (say, N/2) than a full-list search like grep. No extra costs.

    Implementation: There are 3 ways to do them in Perl:

    • Smart match operator (~~). The problem is that it is ONLY available in Perl 5.10 and above.
    • Your own loop that does next; once found.
    • List::MoreUtils module's first() subroutine.

    Comparison:

    • First, between the 3 implementations above, List::MoreUtils::first is faster than the DIY loop because it's implemented in XS; so it should be used in Perl versions before 5.10. Smart match is probably just as fast, though I would benchmark the two before you pick one or the other in Perl 5.10+.

    • Second, comparing short circuited search to other methods, there are only 3 edge cases where it should be used:

      A. Memory constraints. Both sorted list search, BST and hash lookups have memory footprint of at the very least 2*N. If you face a memory constraint (given your list size) severe enough that N vs 2*N memory becomes a non-negotiable cost barrier, then you use short circuited search and pay the performance penalty in time. This is especially true when you're processing a large data set in batches/line-by-line, so as to avoid storing the whole thing in memory in the first place.

      B. If your data is distributed and pre-sorted in such a way that a VAST majority of the searches would find their quarry in the very beginning of the list. If that's the case, it MAY outperform the fancier methods like BST of binary search despite their obviously faster O(log N) average searches. It'd still be hard to outperform hash lookups, but more on that later.

      C. Short circuited search is superior to BSTs or sorted list searches if the # of searches performed is fairly small compared to the list size, in which case the initial sorting cost of the first 2 methods (O(N log N)) would outweigh the search savings. Since the savings of BST vs. linear search are O(M * N) where M is # of searches, it follows that # of searches M must be less than O(log N) to realize the savings on average, but may be more in the second edge case where average scan cost is less than O(N) due to data distribution.


  4. Hash lookups

    Performance Costs:

    • O(N) + epsilon for initial hash creation (It's not strictly speaking O(N) for a random large data set, due to possible key collision. I don't know enough about Perl's hash implementation to clarify this other than state that it CAN be come a concern for any hashmaps.
    • O(1) on average for inserting/removing data in the list once sorted (+ same epsilon as initial hash creation due to key collisions).
    • O(1) for each search (plus same epsilon).

    Implementation:

    my %lookup = map { $_ => 1 } @list; 
    my @lookup2{ @list } = (); # Alternative method of creating a hash
    sub find { return exists $lookup{$_[0]; }
    

    Comparison:

    • First, the same logic applies to comparing short circuited search with hash lookups as with BST vs short-circuited search. E.g., you should ALMOST always use hashmaps over linear search, with the exception of the same two edge cases (data set is such that average list scan becomes O(1) instead of O(N) and the ratio of # of searches to the data set size makes the aggregate cost of searches less than O(N) needed for hash creation).

    • Second, hashmaps ON AVERAGE are obviously much faster than BSTs or binary list search. The only possible edge case here is that you somehow stumble into a data set that manages to overload the buckets and turn that extra "epsilon" cost into a large enough weight that it starts under-performing O(log N).

      I strongly doubt that it is even remotely possible, but again, don't know enough about Perl's implementation of hashmaps to prove that it would never happen for even the worst data set.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...