Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
258 views
in Technique[技术] by (71.8m points)

python - How to control padding of Unicode string containing east Asia characters

I got three UTF-8 stings:

hello, world
hello, 世界
hello, 世rld

I only want the first 10 ascii-char-width so that the bracket in one column:

[hello, wor]
[hello, 世 ]
[hello, 世r]

In console:

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

One chinese char is three bytes, but it only 2 ascii chars width when displayed in console:

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, xe4xb8x96xe7x95x8c'

python's format() doesn't help when UTF-8 chars mixed in

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

It's not pretty:

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

So, I wonder if there is a standard way to do the UTF-8 padding staff?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

When trying to line up ASCII text with Chinese in fixed-width font, there is a set of full width versions of the printable ASCII characters. Below I made a translation table of ASCII to full width version:

# coding: utf8

# full width versions (SPACE is non-contiguous with ! through ~)
SPACE = 'N{IDEOGRAPHIC SPACE}'
EXCLA = 'N{FULLWIDTH EXCLAMATION MARK}'
TILDE = 'N{FULLWIDTH TILDE}'

# strings of ASCII and full-width characters (same order)
west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))

# build the translation table
full = str.maketrans(west,east)

data = '''
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行(魔戒王者再临主题曲)(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

# Replace the ASCII characters with full width, and create a song list.
data = data.translate(full).rstrip().split('
')

# translate each printable line.
print(' ----------Songs-----------'.translate(full))
for i,song in enumerate(data):
    line = '|{:4}: {:20.20}|'.format(i+1,song)
    print(line.translate(full))
print(' --------------------------'.translate(full))

Output

 ----------Songs-----------
|   1: 蝴蝶(A song)          |
|   2: 心之城(Another song)   |
|   3: 支持你的爱人(Yet another s|
|   4: 根生的种子               |
|   5: 鸽子歌(Cucurrucucu palo|
|   6: 林地之间                |
|   7: 蓝光                  |
|   8: 在你眼里                |
|   9: 肖邦离别曲               |
|  10: 西行(魔戒王者再临主题曲)(Into s|
|  11: 深陷爱河                |
|  12: 钟爱大地                |
|  13: 时光流逝                |
|  14: 卡农                  |
|  15: 舒伯特小夜曲(SERENADE)    |
|  16: 甜蜜的摇篮曲(Sweet Lullaby|
 --------------------------

It's not overly pretty, but it lines up.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...