csv - Loading UTF-8 file in Python 3 using numpy.genfromtxt

Question

Welcome To Ask or Share your Answers For Others

csv - Loading UTF-8 file in Python 3 using numpy.genfromtxt

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

csv - Loading UTF-8 file in Python 3 using numpy.genfromtxt

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code:

import numpy
#U75 - unicode string of max. length 75
world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",")
print(world_alcohol)

And I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

I guess that numpy has a problem reading the string "C?te d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2.

What am I doing wrong? How can I read the file into numpy?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:50:06+0000

Note the original 2015 date. Since then genfromtxt has gotten an encoding parameter.

In Python3 I can do:

In [224]: txt = "C?te d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]: 
array(["C?te d'Ivoire", ''],   dtype='<U20')

Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

I can load it into a 'S' array with genfromtxt:

In [258]: txt="C?te d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]: 
array(b"Cxc3xb4te d'Ivoire",  dtype='|S20')

and apply decode to individual elements:

In [261]: print(a.item().decode())
C?te d'Ivoire

In [325]: print _
C?te d'Ivoire

Or use np.char.decode to apply it to each element of an array:

In [263]: np.char.decode(a)
Out[263]: 
array("C?te d'Ivoire", dtype='<U13')
In [264]: print(_)
C?te d'Ivoire

genfromtxt lets me specify converters:

In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
    converters={0:lambda x: x.decode()})
Out[297]: 
array("C?te d'Ivoire", dtype='<U20')

If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

(See my earlier edits for Python2 tries).

Categories

csv - Loading UTF-8 file in Python 3 using numpy.genfromtxt

csv - Loading UTF-8 file in Python 3 using numpy.genfromtxt

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags