Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
329 views
in Technique[技术] by (71.8m points)

php - RegEx: w - "_" + "-" in UTF-8

I need a regular expression that matches UTF-8 letters and digits, the dash sign (-) but doesn't match underscores (_), I tried these silly attempts without success:

  • ([w-^_])+
  • ([w^_]-?)+
  • (w[^_]-?)+

The w is shorthand for [A-Za-z0-9_], but it also matches UTF-8 chars if I have the u modifier set.

Can anyone help me out with this one?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try this:

(?:[w-](?<!_))+

It does a simple match on anything that is encoded as a w (or a dash) and then has a zero-width lookbehind that ensures that the character that was just matched is not a underscore.

Otherwise you could pick this one:

(?:[^_W]|-)+

which is a more set-based approach (note the uppercase W)

OK, I had a lot of fun with unicode in php's flavor of PCREs :D Peekaboo says there is a simple solution available:

[p{L}p{N}-]+

p{L} matches anything unicode that qualifies as a Letter (note: not a word character, thus no underscores), while p{N} matches anything that looks like a number (including roman numerals and more exotic things).
- is just an escaped dash. Although not strictly necessary, I tend to make it a point to escape dashes in character classes... Note, that there are dozens of different dashes in unicode, thus giving rise to the following version:

[p{L}p{N}p{Pd}]+

Where "Pd" is Punctuation Dash, including, but not limited to our minus-dash-thingy. (Note, again no underscore here).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...