Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
530 views
in Technique[技术] by (71.8m points)

regex - Why do Perl string operations on Unicode characters add garbage to the string?

Perl:

$string =~ s/[áàa?]/a/gi; #This line always prepends an "a"
$string =~ s/[éèê?]/e/gi;
$string =~ s/[úù?ü]/u/gi;

This regular expression should convert "été" into "ete". Instead, it is converting it to "aetae". In other words, it prepends an "a" to every matched element. Even "à" is converted to "aa".

If I change the first line to this

$string =~ s/(á|à|a|?)/a/gi;

it works, but... Now it prepends an e to every matched element (like "eetee").

Even though I found a suitable solution, why does it behave that way?

Edit 1:

I added "use utf8;", but it did not change the behavior (although it broke my output in JavaScript/AJAX).

Edit2:

The Stream originates from an Ajax Request, performed by jQuery. The site it originates from is set to UTF-8.

I am using Perl v5.10 (perl -v returns "This is perl, v5.10.0 built for i586-linux-thread-multi").

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The problem is very likely down to not having

use utf8;

(or its equivalent for whatever coding system you are using) in your program. The weird replacements you have there look like problems with bytewise rather than characterwise regular expression replacement.

#!/usr/local/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, "utf8";
my $string = "été";

$string =~ s/[áàa?]/a/gi; #This line always prepends an "a"
$string =~ s/[éèê?]/e/gi;
$string =~ s/[úù?ü]/u/gi;

print "$string
";

prints

ete

If you are reading input from a file or from standard input, make sure you have the stream set to utf8 or whatever is appropriate for the encoding. For STDIN use

binmode STDOUT, "utf8";

If you are reading from a file, use

open my $file, "<:utf8", "file_name"

to get the encoding right. If it is not in UTF-8, use encoding(name) instead of utf8.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...