Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
688 views
in Technique[技术] by (71.8m points)

negative lookahead regex on elasticsearch

I'm trying to do a negative lookahead on an elasticsearch query, the regex is:

(?!.*charge)(?!.*encode)(?!.*relate).*night.*

the text that I'm matching against is:

credited back on night stay, still having issues with construction. causing health issues due to a chemical being sprayed and causes eyes to irritated.

I didn't get any lucky. Can someone give a hand?

ES query:

  "query": {
    "filtered": {
      "query": {
        "bool": {
          "must_not": [
            {
              "regexp": {
                "message": {
                  "value": "(?!.*charge)(?!.*encode)(?!.*relate).*night.*",
                  "flags_value": 65535
                }
              }
            }
          ]
        }
      },
      "filter": {
        "match": {
          "resNb": {
            "query": "462031152161",
            "type": "boolean"
          }
        }
      }
    }
  }
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Solution

You can solve the issue with either of the two:

"value": "~(charge|encode|relate)night~(charge|encode|relate)",

or

.*night.*&~(.*(charge|encode|relate).*)

With an optional (since it is ON by default)

"flags" : "ALL"

How does it work?

In common NFA regular expressions, you usually have negative lookarounds that help restrict a more generic pattern (those that look like (?!...) or (?<!...)). However, in ElasticSearch, you need to use specific optional operators.

The ~ (tilde) is the complement that is *used to negate an atom right after it. An atom is either a single symbol or a group of subpatterns/alternatives inside a group.

NOTE that all ES patterns are anchored at the start and end of string by default, you never need to use ^ and $ common in Perl-like and .NET, and other NFAs.

Thus,

  • ~(charge|encode|relate) - matches any text from the start of the string other than charge, encode and relate
  • night - matches the word night
  • ~(charge|encode|relate) - matches any text other than either of the 3 substrings up to the end of string.

In an NFA regex like Perl, you could write that pattern using a tempered greedy token:

/^(?:(?!charge|encode|relate).)*night(?:(?!charge|encode|relate).)*$/

The second pattern is trickier: common NFA regexes usually do not jump from location to location when matching, thus, lookaheads anchored at the start of text are commonly used. Here, using an INTERSECTION we can just use 2 patterns, where one will be matching the string and the second one should also match the string.

  • .*night.* - match the whole line (as . matches any symbol but a newline, else, use (.| )*) with night in it
  • & - and
  • ~(.*(charge|encode|relate).*) - the line that does not have charge, encode and relate substrings in it.

An NFA Perl-like regex would look like

/^(?!.*(charge|encode|relate)).*night.*$/

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...