Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
300 views
in Technique[技术] by (71.8m points)

What's the deal with Python 3.4, Unicode, different languages and Windows?

Happy examples:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

czech = u'Leo? Janá?ek'.encode("utf-8")
print(czech)

pl = u'Zdzis?aw Beksiński'.encode("utf-8")
print(pl)

jp = u'リング 山村 貞子'.encode("utf-8")
print(jp)

chinese = u'五行'.encode("utf-8")
print(chinese)

MIR = u'Машина для Инженерных Расчётов'.encode("utf-8")
print(MIR)

pt = u'Minha Língua Portuguesa: ?áà'.encode("utf-8")
print(pt)

Unhappy output:

b'Leoxc5xa1 Janxc3xa1xc4x8dek'
b'Zdzisxc5x82aw Beksixc5x84ski'
b'xe3x83xaaxe3x83xb3xe3x82xb0 xe5xb1xb1xe6x9dx91 xe8xb2x9exe5xadx90'
b'xe4xbax94xe8xa1x8c'
b'xd0x9cxd0xb0xd1x88xd0xb8xd0xbdxd0xb0 xd0xb4xd0xbbxd1x8f xd0x98xd0xbdxd0xb6xd0xb5xd0xbdxd0xb5xd1x80xd0xbdxd1x8bxd1x85 xd0xa0xd0xb0xd1x81xd1x87xd1x91xd1x82xd0xbexd0xb2'
b'Minha Lxc3xadngua Portuguesa: xc3xa7xc3xa1xc3xa0'

And if I print them like this:

jp = u'リング 山村 貞子'
print(jp)

I get:

Traceback (most recent call last):
  File "x.py", line 5, in <module>
    print(jp)
  File "C:Python34libencodingscp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>

I've also tried the following from this question (And other alternatives that involve sys.stdout.encoding):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function
import sys

def safeprint(s):
    try:
        print(s)
    except UnicodeEncodeError:
        if sys.version_info >= (3,):
            print(s.encode('utf8').decode(sys.stdout.encoding))
        else:
            print(s.encode('utf8'))

jp = u'リング 山村 貞子'
safeprint(jp)

And things get even more cryptic:

πa?πa│πé? σ??μ¥? Φ▓?σ?é

And the docs were not very helpful.

So, what's the deal with Python 3.4, Unicode, different languages and Windows? Almost all possible examples I could find, deal with Python 2.x.

Is there a general and cross-platform way of printing ANY Unicode character from any language in a decent and non-nasty way in Python 3.4?

EDIT:

I've tried typing at the terminal:

chcp 65001

To change the code page, as proposed here and in the comments, and it did not work (Including the attempt with sys.stdout.encoding)

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Update: Since Python 3.6, the code example that prints Unicode strings directly should just work now (even without py -mrun).


Python can print text in multiple languages in Windows console whatever chcp says:

T:> py -mpip install win-unicode-console
T:> py -mrun your_script.py

where your_script.py prints Unicode directly e.g.:

#!/usr/bin/env python3
print('? á?')      # cz
print('? ń')       # pl
print('リング')     # jp
print('五行')      # cn
print('ш я жх ё') # ru
print('í ?áà')    # pt

All you need is to configure the font in your Windows console that can display the desired characters.

You could also run your Python script via IDLE without installing non-stdlib modules:

T:> py -midlelib -r your_script.py

To write to a file/pipe, use PYTHONIOENCODING=utf-8 as @Mark Tolonen suggested:

T:> set PYTHONIOENCODING=utf-8
T:> py your_script.py >output-utf8.txt 

Only the last solution supports non-BMP characters such as ?? (U+1F612 UNAMUSED FACE) -- py -mrun can write them but Windows console displays them as boxes even if the font supports corresponding Unicode characters (though you can copy-paste the boxes into another program, to get the characters).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...