Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
863 views
in Technique[技术] by (71.8m points)

regex - How do you split a javascript string by spaces and punctuation?

I have some random string, for example: Hello, my name is john.. I want that string split into an array like this: Hello, ,, , my, name, is, john, .,. I tried str.split(/[^ws]|_/g), but it does not seem to work. Any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To split a str on any run of non-word characters I.e. Not A-Z, 0-9, and underscore.

var words=str.split(/W+/);  // assumes str does not begin nor end with whitespace

Or, assuming your target language is English, you can extract all semantically useful values from a string (i.e. "tokenizing" a string) using:

var str='Here's a (good, bad, indifferent, ...) '+
        'example sentence to be used in this test '+
        'of English language "token-extraction".',

    punct='\['+ '\!'+ '"'+ '\#'+ '\$'+   // since javascript does not
          '\%'+ '\&'+ '\''+ '\('+ '\)'+  // support POSIX character
          '\*'+ '\+'+ '\,'+ '\\'+ '\-'+  // classes, we'll need our
          '\.'+ '\/'+ '\:'+ '\;'+ '\<'+   // own version of [:punct:]
          '\='+ '\>'+ '\?'+ '\@'+ '\['+
          '\]'+ '\^'+ '\_'+ '\`'+ '\{'+
          '\|'+ '\}'+ '\~'+ '\]',

    re=new RegExp(     // tokenizer
       '\s*'+            // discard possible leading whitespace
       '('+               // start capture group
         '\.{3}'+            // ellipsis (must appear before punct)
       '|'+               // alternator
         '\w+\-\w+'+       // hyphenated words (must appear before punct)
       '|'+               // alternator
         '\w+'(?:\w+)?'+   // compound words (must appear before punct)
       '|'+               // alternator
         '\w+'+              // other words
       '|'+               // alternator
         '['+punct+']'+        // punct
       ')'                // end capture group
     );

// grep(ary[,filt]) - filters an array
//   note: could use jQuery.grep() instead
// @param {Array}    ary    array of members to filter
// @param {Function} filt   function to test truthiness of member,
//   if omitted, "function(member){ if(member) return member; }" is assumed
// @returns {Array}  all members of ary where result of filter is truthy
function grep(ary,filt) {
  var result=[];
  for(var i=0,len=ary.length;i++<len;) {
    var member=ary[i]||'';
    if(filt && (typeof filt === 'Function') ? filt(member) : member) {
      result.push(member);
    }
  }
  return result;
}

var tokens=grep( str.split(re) );   // note: filter function omitted 
                                     //       since all we need to test 
                                     //       for is truthiness

which produces:


tokens=[ 
  'Here's',
  'a',
  '(',
  'good',
  ',',
  'bad',
  ',',
  'indifferent',
  ',',
  '...',
  ')',
  'example',
  'sentence',
  'to',
  'be',
  'used',
  'in',
  'this',
  'test',
  'of',
  'English',
  'language',
  '"',
  'token-extraction',
  '"',
  '.'
]

EDIT

Also available as a Github Gist


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...