As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?
Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.
# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end
# Inside code looping through lines of input.
# The variables 'regex' and 'line_encoding' should be initialized previously, to
# persist across loops.
if line.methods.include?(:encoding) # Ruby 1.8 compatibility
if line.encoding != last_encoding
regex = get_regex('<p>(.*)</p>',line.encoding,16) # //u = 00010000 option bit set = 16
last_encoding = line.encoding
end
end
line.match(regex)
In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…