Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
5.3k views
in Technique[技术] by (71.8m points)

java - A bug in a regex in JDK 8?

I have this reference working Perl script with a regex, copied from a Java snippet that isn't giving the expected results:

my $regex = '^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$';
if ("A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A" =~ /$regex/)
{
    print "Matches 1=$1 2=$2 3=$3 4=$4
";
}

This correctly outputs:

Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A

Now the equivalent Java snippet:

private static final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
private static final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(MutableUniqueIdentity.NON_SYSTEM_TYPE_REGEX);
    ...

final Matcher match = MutableUniqueIdentity.NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);

The uniqueIdentity input is further back in the stack trace (in a unit test) and is this value:

final String id5CompactString = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";

NOTE: The regex and uniqueIdentity values were copied to the Perl program from a debug session to assert if a different language comes up with a different result (which it did).

ADDITIONAL NOTE: The reason the non-capture group is there is to allow the third element in the string to be optional, so it has to deal with both of these:

   A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A
   A-PROD-COMP-00000000-0000-8033-0000-000200354F0A

My unit test fails in Java - the third match group, which should be LOGL, is in fact 0000.

Here is a screenshot of the debugger right after the regex match line above: enter image description here

You can see that the pattern matches, you can verify that the input parameter (text) and regex are the same as the Perl script, but the result is different!

So my question is: Why does match.groups(3) have a value of 0000 (when it should have a value LOGL) and how does that related back to the regex and the string it is applied to?

In Perl it yields the correct result - LOGL.

Additional info: I have perused this page that highlights the differences between Perl and Java regex engines, and there doesn't appear to be anything applicable.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Replace your regex with the following regex:

^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
This has been moved out----------^

I have moved - out of the non-capturing group.

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
        final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(NON_SYSTEM_TYPE_REGEX);
        String uniqueIdentity = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";
        final Matcher match = NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);

        if (match.find()) {
            System.out.printf("Matches 1=%s 2=%s 3=%s 4=%s%n", match.group(1), match.group(2), match.group(3),
                    match.group(4));
        }
    }
}

Output:

Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A

Check the demo at regex101 as well.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...