Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
845 views
in Technique[技术] by (71.8m points)

python - How to read bytes object from csv?

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.

Now, the text data is stored like this:

"b'Lorem Ipsumxc2xa0Assignment '"

I tried to decode this using this code (there is more data in other columns, text is in 3rd column):

with open('data.csv','rt',encoding='utf-8') as f:
    reader = csv.reader(f,delimiter=',')
    for row in reader:
        print(row[3])

But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!

How can I decode the text data?

Edit: Here's a sample line from the csv file:

67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district inxe2x80xa6  | @abcde',52,18

Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval function @Ryan mentioned although I would use it in a slightly different way, as shown below.

This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.

import ast
import csv


def _parse_bytes(field):
    """ Convert string represented in Python byte-string literal b'' syntax into
        a decoded character string - otherwise return it unchanged.
    """
    result = field
    try:
        result = ast.literal_eval(field)
    finally:
        return result.decode() if isinstance(result, bytes) else field


def my_csv_reader(filename, /, **kwargs):
    with open(filename, 'rt', newline='') as file:
        for row in csv.reader(file, **kwargs):
            yield [_parse_bytes(field) for field in row]


reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
    print(row)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...