Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

regex - How do you use the Java word boundary with apostrophes?

I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.

String phrase="bob has a bike and bob's bike is red";
String word="bob";
phrase=phrase.replaceAll("\b"+word+"\b","");
System.out.println(phrase);

output:
has a bike and 's bike is red

What I want is
has a bike and bob's bike is red

I have a limited understanding of regex so I'm guessing there is a solution, but I do not now enough to create the regex to handle apostrophes. Also I would like it to work with dashes so the phrase the new mail is e-mail would only replace the first occurrence of mail.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It all depends on what you understan to be a "word". Perhaps you'd better define what you understand to be a word delimiter: for example, blanks, commas .... And write something as

phrase=phrase.replaceAll("([ \s,.;])" + Pattern.quote(word)+ "([ \s,.;])","$1$2");

But you'll have to check additionally for occurrences at the start and the end of the string For example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and "bob" stuff.";
  String word="bob";
  phrase=phrase.replaceAll("([\s,.;])" + Pattern.quote(word) + "([\s,.;])","$1$2");
  System.out.println(phrase);

prints this

bob has a bike ,  and boba bob's bike is red and "bob" stuff.

Update: If you insist in using , considering that the "word boundary" understand Unicode, you can also do this dirty trick: replace all ocurrences of ' by some Unicode letter that you're are sure will not appear in your text, and afterwards do the reverse replacemente. Example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and "bob" stuff.";
  String word="bob";
  phrase= phrase.replace("'","?").replace('"','?');
  phrase=phrase.replaceAll("\b" + Pattern.quote(word) + "\b","");
  phrase= phrase.replace('?','"').replace("?","'");
  System.out.println(phrase);

UPDATE: To summarize some comments below: one would expect w and to have the same notion as to which is a "word character", as almost every regular-expression dialect do. Well, Java does not: w considers ASCII, considers Unicode. It's an ugly inconsistence, I agree.

Update 2: Since Java 7 (as pointed out in comments) the UNICODE_CHARACTER_CLASS flag allows to specify a consistent Unicode-only behaviour, see eg here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...