Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
441 views
in Technique[技术] by (71.8m points)

python - 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, - terminated lines, with the first line containing the field names in lowercase.

My current code:

import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)

All attempts ended with the following error message:

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me? Many thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...