Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
531 views
in Technique[技术] by (71.8m points)

regex - JavaScript Regular Expression Replace Multiple Letters

I have the following code to replace a string of DNA with its complement where A <=> T and C <=> G. I do this with very basic knowledge of regular expressions. How can I refactor the following using regular expressions to capture the letter and replace it with its complement.

function DNA(strand) {
    return strand.replace(/A|T|C|G/g, x => {
        return (x=="A") ? "T" : (x=="T") ? "A" (x=="C") ? "G" : "C";
    });
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is rather inelegant, IMO, but it is a one(two?) step replacement algorithm that uses javascript regex capabilities - if you're interested, I can explain what the heck it's doing

function DNA(strand) {
    return strand
        .concat("||TACG")
        .replace(/A(?=.*?(T))|T(?=.*?(A))|C(?=.*?(G))|G(?=.*?(C))|||....$/gi, "$1$2$3$4");
}

See this fiddle (now updated a bit for testability) to play around with it.

This might seem like a simple example for which to build a regex, but it's not really (if you want it to all be in the regex, that is). It would be far more efficient to use a simple mapping table (hashtable), split the characters, remap/translate them, and join them together (as @Jared Smith did), since the regex engine is not very efficient. If this is solely for personal interest and learning regex, then please feel free to ask for any required explanation.

Edit for jwco:

As I stated, this is rather inelegant (or at least inefficient) for a production level solution, but perhaps rather elegant as an art piece(?). It uses only JavaScript regex(Regexp) capabilities, so no "regular expression conditions" or "look-behind", and if JavaScript supported "free-spacing", you could actually use the regex as shown below.

This is a relatively common way of breaking down components of a regex to explain what each part is matching, looking for and capturing:

  A         #  Match an A, literally
  (?=       #  Look ahead, and
    .*?     #    Match any number of any character lazily (as necessary)
    (T)     #    Match and capture a T, literally (into group #1)
  )         #  End look-ahead
|           #-OR-
  T         #  Match a T, literally
  (?=       #  Look ahead, and
    .*?     #    Match any number of any character lazily (as necessary)
    (A)     #    Match and capture an A, literally (into group #2)
  )         #  End look-ahead
|           #-OR-
  C         #  Match a C, literally
  (?=       #  Look ahead, and
    .*?     #    Match any number of any character lazily (as necessary)
    (G)     #    Match and capture a G, literally (into group #3)
  )         #  End look-ahead
|           #-OR-
  G         #  Match a G, literally
  (?=       #  Look ahead, and
    .*?     #    Match any number of any character lazily (as necessary)
    (C)     #    Match and capture a C, literally (into group #4)
  )         #  End look-ahead
|           #-OR-
 ||....$  #  match two literal pipes (|), followed by four of any character and the end of the string

Anything matched by this expression (which should be every part of the entire string) will be replaced by the replacement expression $1$2$3$4. The "global" flag (the g in the /gi) will make it keep trying to match as long as there is more of the string to test.

The expression is made up of five possible options (one for each possible letter switch and then a "cleanup" match). The first four options are identical except for the particular letters matched. Each of these matches and consumes a particular desired letter, then "looks ahead" in the string to find its "translation" or "complement", captures it without consuming anything else, then completes as a successful alternative, thus satisfying the expression as a whole.

Since only one of the matching groups (1-4) could have matched for any successful tested letter, only one of the backreferences ($1, etc in $1$2$3$4) could possibly contain a captured value. In the case of the fifth option (||....$), there is no capture, so none of the capture groups contain a value with which to replace the match.

Before being fed into the regex engine, the string ||TACG is appended to the source, kind of like a telomere... ... sorta... -- this provides a replacement source, if the source string does not contain the appropriate "complement" letter in an earlier position (or at all?!). The last option in the regex effectively removes this extraneous information, by matching it and replacing it with nothing.

This could be done for any set of replacements, but gets less and less efficient as more changes are appended. Maintainability for such a regex would also, as indicated by a certain commenter's (I hope jovial) threat, ummm.... it would be a challenge. Enjoy!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...