Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

csv - Loading UTF-8 file in Python 3 using numpy.genfromtxt

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code:

import numpy
#U75 - unicode string of max. length 75
world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",")
print(world_alcohol)

And I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

I guess that numpy has a problem reading the string "C?te d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2.

What am I doing wrong? How can I read the file into numpy?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Note the original 2015 date. Since then genfromtxt has gotten an encoding parameter.


In Python3 I can do:

In [224]: txt = "C?te d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]: 
array(["C?te d'Ivoire", ''],   dtype='<U20')

Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

I can load it into a 'S' array with genfromtxt:

In [258]: txt="C?te d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]: 
array(b"Cxc3xb4te d'Ivoire",  dtype='|S20')

and apply decode to individual elements:

In [261]: print(a.item().decode())
C?te d'Ivoire

In [325]: print _
C?te d'Ivoire

Or use np.char.decode to apply it to each element of an array:

In [263]: np.char.decode(a)
Out[263]: 
array("C?te d'Ivoire", dtype='<U13')
In [264]: print(_)
C?te d'Ivoire

genfromtxt lets me specify converters:

In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
    converters={0:lambda x: x.decode()})
Out[297]: 
array("C?te d'Ivoire", dtype='<U20')

If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

(See my earlier edits for Python2 tries).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...