unicode - Why does modern Perl avoid UTF-8 by default?

Question

Welcome To Ask or Share your Answers For Others

unicode - Why does modern Perl avoid UTF-8 by default?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

unicode - Why does modern Perl avoid UTF-8 by default?

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.

I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21^st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.

Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?

Commenting @tchrist got too long, so I'm adding it here.

It seems that I did not make myself clear. Let me try to add some things.

tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.

tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.

I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.

Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.

Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!

I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T21:14:02+0000

???????????????? ?: ?? ???????????????? ??????????????????????????????

Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode @ARGV as UTF?8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF?8. Both these are global effects, not lexical ones.
At the top of your source file (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:
```
use v5.12;  # minimal for unicode string feature
use v5.14;  # optimal for unicode string feature
```
Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8 warning class comprises three other subwarnings which can all be separately enabled: nonchar, surrogate, and non_unicode. These you may wish to exert greater control over.
```
use warnings;
use warnings qw( FATAL utf8 );
```
Declare that this source unit is encoded as UTF?8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:
```
use utf8;
```
Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTF?8 unless you tell it otherwise. That way you do not affect other module’s or other program’s code.
```
use open qw( :encoding(UTF-8) :std );
```
Enable named characters via N{CHARNAME}.
```
use charnames qw( :full :short );
```
If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF?8, then say:
```
binmode(DATA, ":encoding(UTF-8)");
```

There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to “make everything just work with UTF?8”, albeit for a somewhat weakened sense of those terms.

One other pragma, although it is not Unicode related, is:

      use autodie;

It is strongly recommended.

?? ?????? ?? ???? ???????? ?????? ???? ???????????????? ?? ?????? ??

?? ?? ??????????????????????? ?????? ????????????????????????? ???????? ?? ??

My own boilerplate these days tends to look like this:

use 5.014;

use utf8;
use strict;
use autodie;
use warnings; 
use warnings    qw< FATAL  utf8     >;
use open        qw< :std  :utf8     >;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use File::Basename      qw< basename >;
use Carp                qw< carp croak confess cluck >;
use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

END { close STDOUT }

if (grep /P{ASCII}/ => @ARGV) { 
   @ARGV = map { decode("UTF-8", $_) } @ARGV;
}

$0 = basename($0);  # shorter messages
$| = 1;

binmode(DATA, ":utf8");

# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
    confess "Uncaught exception: @_" unless $^S;
};

# now promote run-time warnings into stack-dumped
#   exceptions *unless* we're in an try block, in
#   which case just cluck the stack dump instead
local $SIG{__WARN__} = sub {
    if ($^S) { cluck   "Trapped warning: @_" } 
    else     { confess "Deadly warning: @_"  }
};

while (<>)  {
    chomp;
    $_ = NFD($_);
    ...
} continue {
    say NFC($_);
}

__END__

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

Saying that “Perl should [somehow!] enable Unicode by default” doesn’t even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; it’s also how those characters all interact in many, many ways.

Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to “upgrade” to your spiffy new Brave New World modernity.

It is way way way more complicated than people pretend. I’ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, you’ll break either your own code or somebody else’s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.

?? goes out of its way to make Unicode easy, far more than anything else I’ve ever used. If you think this is bad, try something else for a while. Then come back to ??: either you will have returned to a better world, or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make ?? better at these things.

?? ?????????? ?????? ?? ?????????????? ? ?????????? ?? ?????????????? ???????? ??

At a minimum, here are some things that would appear to be required for ?? to “enable Unicode by default”, as you put it:

All ?? source code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.
The ?? DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").
Program arguments to ?? scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.
The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I, O, and/or E for just some of them. This is like perl -CS.
Any other handles opened by ?? should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all of them.
Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See uniquote.
You don’t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are always binmoded to :encoding(UTF-8), not just to :utf8.
Code points between 128–255 should be understood by ?? to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("xDF") eq "SS" and "xE9" =~ /w/. A simple export PERL5OPTS=-Mv5.12 or better will also get that.
Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See uninames and tcgrep.
You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these yet that I’m aware of, but see nfc, nfd, nfkd, and nfkc.
String comparisons in ?? using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of @a = sort @b, you need @a = Unicode::Collate->new->sort(@b). Might as well add that to your export PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.
?? built-ins like printf and write do the wrong thing with Unicode data. You need to use the Unicode::GCString module for the former, and both that and also the Unicode::LineBreak module as well for the latter. See uwc and unifmt.
If you want them to count as integers, then you are going to have to run your d+ captures through the Unicode::UCD::num function because ??’s built-in atoi(3) isn’t currently clever enough.
You are going to have filesystem issues on ?? filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.
All your ?? code involving a-z or A-Z and such MUST BE CHANGED, including m//, s///, and tr///. It’s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.
Code that uses p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use p{Upper} instead, and know the reason why. Yes, p{Lowercase} and p{Lower} are different from p{Ll} and p{Lowercase_Letter}.
Code that uses [a-zA-Z] is even worse. And it can’t use pL or p{Letter}; it needs to use p{Alphabetic}. Not all alphabetics are letters, you know!
If you are looking for ?? variables with /[$@\%]w+/, then you have a pro

Categories

unicode - Why does modern Perl avoid UTF-8 by default?

unicode - Why does modern Perl avoid UTF-8 by default?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

???????????????? ?: ?? ???????????????? ??????????????????????????????

?? ?? ??????????????????????? ?????? ????????????????????????? ???????? ?? ??

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

?? ?????????? ?????? ?? ?????????????? ? ?????????? ?? ?????????????? ???????? ??

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags