Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
617 views
in Technique[技术] by (71.8m points)

python - Find the length of a sentence with English words and Chinese characters

The sentence may include non-english characters, e.g. Chinese:

你好,hello world

the expected value for the length is 5 (2 Chinese characters, 2 English words, and 1 comma)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use that most Chinese characters are located in the unicode range 0x4e00 - 0x9fcc.

# -*- coding: utf-8 -*-
import re

s = '你好 hello, world'
s = s.decode('utf-8')

# First find all 'normal' words and interpunction
# '[x21-x2f]' includes most interpunction, change it to ',' if you only need to match a comma
count = len(re.findall(r'w+|[x21-x2]', s))

for word in s:
    for ch in word:
        # see https://stackoverflow.com/a/11415841/1248554 for additional ranges if needed
        if 0x4e00 < ord(ch) < 0x9fcc:
            count += 1

print count

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...