Look at the "any letter" unicode character class, p{L}, or at the Pattern.UNICODE_CHARACTER_CLASS parameter to the java Pattern.compile method.
I guess the second one, as being Java only, won't interest you, but is worth mentioning.
import java.util.regex.Pattern;
/**
* @author Luc
*/
public class Test {
/**
* @param args
*/
public static void main(final String[] args) {
test("Bonjour");
test("????????");
test("世界人权宣言 ");
}
private static void test(final String text) {
showMatch(Pattern.compile("\b\p{L}+\b"), text);
showMatch(Pattern.compile("\b\w+\b", Pattern.UNICODE_CHARACTER_CLASS), text);
}
private static void showMatch(final Pattern pattern, final String text) {
System.out.println("With pattern "" + pattern + "": " + text + " " + pattern.matcher(text).find());
}
}
Results :
With pattern "w+": Bonjour true
With pattern "p{L}+": Bonjour true
With pattern "w+": ???????? true
With pattern "p{L}+": ???????? true
With pattern "w+": 世界人权宣言 true
With pattern "p{L}+": 世界人权宣言 true
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…