unicode - ANTLR4: Using non-ASCII characters in token rules

Question

Welcome To Ask or Share your Answers For Others

unicode - ANTLR4: Using non-ASCII characters in token rules

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

unicode - ANTLR4: Using non-ASCII characters in token rules

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:

'uxxxx'

where xxxx is the hexadecimal value for the Unicode codepoint.

So I used that technique in a token rule for an ID token:

grammar ID;

id : ID EOF ;

ID : ('a' .. 'z' | 'A' .. 'Z' | 'u0100' .. 'u017E')+ ;
WS : [ 
]+ -> skip ;

When I tried to parse this input:

G?nter

ANTLR throws an error, saying that it does not recognize ?. (The ? character is hex 016D, so it is within the range specified)

What am I doing wrong please?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:21:36+0000

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:

InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

Categories

unicode - ANTLR4: Using non-ASCII characters in token rules

unicode - ANTLR4: Using non-ASCII characters in token rules

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags