Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
185 views
in Technique[技术] by (71.8m points)

python - Generate list from string with proper encoding (UTF-8)

I'm having a hard time trying to generate a list from a string, with a proper UTF-8 encoding, I'm using Python (I'm just learning to program, so bare with my silly question/terrible coding).

The source file is a tweet feed (JSON format), after parsing it successfully and extracting the tweet message from all the rest I manage to get the text with the right encoding only after a print (as a string). If I try to put it pack into list forms, it goes back to unencoded u00000 form.

My code is:

import json

with open("file_name.txt") as tweets_file:
    tweets_list = [] 
    for a in tweets_file:
        b = json.loads(a)
        tweets_list.append(b)

    tweet = []
    for i in tweets_list:
        key = "text"
        if key in i:
            t = i["text"]
            tweet.append(t)

    for k in tweet:
        print k.encode("utf-8")

As an alternative, I tried to have the encoding at the beginning (when fetching the file):

import json
import codecs

tweets_file = codecs.open("file_name.txt", "r", "utf-8")
tweets_list = [] 
for a in tweets_file:
    b = json.loads(a)
    tweets_list.append(b)
tweets_file.close()

tweet = []
for i in tweets_list:
    key = "text"
    if key in i:
        t = i["text"]
        tweet.append(t)

for k in tweet:
    print k

My question is: how can I put the resulting k strings, into a list? With each k string as an item?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are getting confused by the Python string representation.

When you print a python list (or any other standard Python container), the contents are shown in special representation to make debugging easier; each value is shown is the result of calling the repr() function on that value. For string values, that means the result is a unicode string representation, and that is not the same thing as what you see when the string is printed directly.

Unicode and byte strings, when shown like that, are presented as string literals; quoted values that you can copy and paste straight back into Python code without having to worry about encoding; anything that is not a printable ASCII character is shown in quoted form. Unicode code points beyond the latin-1 plane are shown as 'u....' escape sequences. Characters in the latin-1 range use the 'x.. escape sequence. Many control characters are shown in their 1-letter escape form, such as and .

The python interactive prompt does the same thing; when you echo a value on the prompt without using print, the value in 'represented', shown in the repr() form:

>>> print u'u2036Hello World!u2033'
?Hello World!″
>>> u'u2036Hello World!u2033'
u'u2036Hello World!u2033'
>>> [u'u2036Hello World!u2033', u'Another
string']
[u'u2036Hello World!u2033', u'Another
string']
>>> print _[1]
Another
string

This entirly normal behaviour. In other words, your code works, nothing is broken.

To come back to your code, if you want to extract just the 'text' key from the tweet JSON structures, filter while reading the file, don't bother with looping twice:

import json

with open("file_name.txt") as tweets_file:
    tweets = [] 
    for line in tweets_file:
        data = json.loads(a)
        if 'text' in data:
            tweets.append(data['text'])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...