Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

regex - How to parse string into words and punctuation marks using javascript

I have a string test="hello how are you all doing, I hope that it's good! and fine. Looking forward to see you.

I am trying to parse the string into words and punctuation marks using javascript. I am able to separate words but then punctuation marks disappear using the regex

var result= test.match(/(w|')+/g);

So my expected output is

hello
how 
are 
you
all
doing
,
I
hope
that
it's
good
!
and 
fine
.
Looking
forward
to
see
you
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Simple approach

This first approach if you, and javascript's definition of "word" match. A more customizable approach is below.

Try test.split(/s*s*/). It splits on word boundaries () and eats whitespace.

"hello how are you all doing, I hope that it's good! and fine. Looking forward to see you."
    .split(/s*s*/);
// Returns:
["hello",
"how",
"are",
"you",
"all",
"doing",
",",
"I",
"hope",
"that",
"it",
"'",
"s",
"good",
"!",
"and",
"fine",
".",
"Looking",
"forward",
"to",
"see",
"you",
"."]

How it works.

var test = "This is. A test?"; // Test string.

// First consider splitting on word boundaries ().
test.split(//); //=> ["This"," ","is",". ","A"," ","test","?"]
// This almost works but there is some unwanted whitespace.

// So we change the split regex to gobble the whitespace using s*
test.split(/s*s*/) //=> ["This","is",".","A","test","?"]
// Now the whitespace is included in the separator
// and not included in the result.

More involved solution.

If you want words like "isn`t" and "one-thousand" to be treated as a single word while javascript regex considers them to be two you will need to create your own definition of a word.

test.match(/[w-']+|[^ws]+/g) //=> ["This","is",".","A","test","?"]

How it works

This matches the actual words an punctuation characters separately using an alternation. The first half of the regex [w-']+ matches whatever you consider to be a word, and the second half [^ws]+ matches whatever you consider punctuation. In this example I just used whatever isn't a word or whitespace. I also but a + on the end so that multi-character punctuation (such as ?! which is properly written ?) is treated as a single character, if you don't want that remove the +.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...