Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
186 views
in Technique[技术] by (71.8m points)

python - Reading unicode elements into numpy array

Consider a text file called "new.txt" containing the following elements:

μm
?r
?λ

In Python 2.7, I can read the file by typing:

>>> import codecs
>>> f = codecs.open('new.txt', encoding='utf-8')
>>> lines = [line.strip() for line in f2.readlines()]
>>> lines
[u'u03bcm', u'u2202r', u'u2206u03bb']
>>> print lines[0]
μm

So far so good. I can easily convert this list to a numpy array via:

>>> import numpy as np
>>> arr = np.array(lines)
>>> arr
array([u'u03bcm', u'u2202r', u'u2206u03bb'], 
      dtype='<U2')

The issue is, I can't read this file directly via numpy's loadtxt function:

>>> np.loadtxt('new.txt', dtype=np.unicode_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/numpy/lib/npyio.py", line 805, in loadtxt
    X = np.array(X, dtype)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

What is the correct way to read this file into numpy directly?

Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In memory, unicode strings are represented as UCS-2 or UCS-4, depending on how your Python interpreter was compiled. Your file is encoded in UTF-8, so you need to recode it before you can map it to the NumPy array. loadtxt() can't do the recoding for you -- after all NumPy is mainly targeted at numerical arrays.

Assuming every line has the same number of characters, you could also use the more efficient variant

s = codecs.open("new.txt", encoding="utf-8").read()
arr = numpy.frombuffer(s, dtype="<U3")

This will include the newline characters in the strings. To not include them, use

arr = numpy.frombuffer(s.replace("
", ""), dtype="<U2")

Edit: If the lines of your file have different lengths and you would like to avoid the intermediate list, you can use

arr = numpy.fromiter(codecs.open("new.txt", encoding="utf-8"), dtype="<U2")

I'm not sure if this will internally create some temporary list, though.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...