Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
527 views
in Technique[技术] by (71.8m points)

regex - Ruby 1.9: Regular Expressions with unknown input encoding

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:

x  = "foo<p>bar</p>baz"
y  = x.encode('UTF-16LE')
re = /<p>(.*)</p>/

x.match(re) 
=> #<MatchData "<p>bar</p>" 1:"bar">

y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)

My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:

if y.methods.include?(:encode)  # Ruby 1.8 compatibility
  if y.encoding.name != 'UTF-8'
    y = y.encode('UTF-8')
  end
end

y.match(/<p>(.*)</p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">

However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end



  # Inside code looping through lines of input.
  # The variables 'regex' and 'line_encoding' should be initialized previously, to
  # persist across loops.
  if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
    if line.encoding != last_encoding
      regex = get_regex('<p>(.*)</p>',line.encoding,16) # //u = 00010000 option bit set = 16
      last_encoding = line.encoding
    end
  end
  line.match(regex)

In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...