with open('result.txt', 'r') as f:
data = f.read()
print 'What type is my data:'
print type(data)
for i in data:
print "what is i:"
print i
print "what type is i"
print type(i)
print i.encode('utf-8')
I have file with string and I am trying to read the file and split the
words by space and save them into a list. Below is my code:
Below is my error messages:
Someone please help!
Update:
I am going to describe what I am trying to do in details here so it give people more context: The goal of what I am trying to do is:
1. Take a Chinese text and break it down into sentences with detecting basic ending punctuations.
2. Take each sentence and use the tool jieba to tokenize characters into meaningful words. For instances, two Chinese character 學,生, will be group together to produce a token '學生' (meaning student).
3. Save all the tokens from the sentence into a list. So the final list will have multiple lists inside as there are multiple sentences in a paragraph.
# coding: utf-8
#encoding=utf-8
import jieba
cutlist = "。!?".decode('utf-8')
test = "【明報專訊】「吉野家」and Peter from US因被誤傳採用日本福島米而要報警澄清,並自爆用內地黑龍江米,日本料理食材來源惹關注。本報以顧客身分向6間日式食店查詢白米產地,其中出售逾200元日式豬扒飯套餐的「勝博殿日式炸豬排」也選用中國大連米,誤以為該店用日本米的食客稱「要諗吓會否再幫襯」,亦有食客稱「好食就得」;壽司店「板長」店員稱採用香港米,公關其後澄清來源地是澳洲,即與平價壽司店「爭鮮」一樣。有飲食界人士稱,雖然日本米較貴、品質較佳,但內地米品質亦有保證。"
#FindToken check whether the character has the ending punctuation
def FindToken(cutlist, char):
if char in cutlist:
return True
else:
return False
'''
cut check each item in a string list, if the item is not the ending punctuation, it will save it to a temporary list called line. When the ending punctuation is encountered it will save the complete sentence that has been collected in the list line into the final list.
'''
def cut(cutlist,test):
l = []
line = []
final = []
'''
check each item in a string list, if the item is not the ending punchuation , it
will save it to a temporary list called line. When the ending punchuation is encountered it will save the complete sentence that has been collected in the list line into the final list.
'''
for i in test:
if i == ' ':
line.append(i)
elif FindToken(cutlist,i):
line.append(i)
l.append(''.join(line))
line = []
else:
line.append(i)
temp = []
#This part iterate each complete sentence and then group characters according to its context.
for i in l:
#This is the function that break down a sentence of characters and group them into phrases
process = list(jieba.cut(i, cut_all=False))
#This is puting all the tokenized character phrases of a sentence into a list. Each sentence
#belong to one list.
for j in process:
temp.append(j.encode('utf-8'))
#temp.append(j)
print temp
final.append(temp)
temp = []
return final
cut(list(cutlist),list(test.decode('utf-8')))
Here is my problem, when I output my final list, it gives me a list of the following result:
[u'u3010', u'u660eu5831', u'u5c08u8a0a', u'u3011', u'u300c', u'u5409u91ceu5bb6', u'u300d', u'and', u' ', u'Peter', u' ', u'from', u' ', u'US', u'u56e0', u'u88ab', u'u8aa4u50b3', u'u63a1u7528', u'u65e5u672c', u'u798fu5cf6', u'u7c73', u'u800c', u'u8981', u'u5831u8b66', u'u6f84u6e05', u'uff0c', u'u4e26', u'u81eau7206', u'u7528u5167', u'u5730', u'u9ed1u9f8d', u'u6c5fu7c73', u'uff0c', u'u65e5u672cu6599u7406', u'u98dfu6750', u'u4f86u6e90', u'u60f9', u'u95dcu6ce8', u'u3002']
How can I turn a list of unicode into normal string?
See Question&Answers more detail:
os