A1
is loaded as an array of bytestrings. Python3 used unicode strings as default, so usually prepends them with the 'b'. That's normal with print
. I'm a little surprised that it does so also during the file write.
In any case, this seems to do the trick:
A2=np.array([x.decode() for x in A1])
np.savetxt("Test.txt", A2, fmt='%s', delimiter=',')
A2
will have a dtype like dtype='<U100'
.
My test array is:
array([b'one.com', b'two.url', b'three.four'], dtype='|S10')
loaded from a simple text file:
one.com
two.url
three.four
.decode
is a string method. [x.decode() for x in A1]
works for a simple 1d array of bytestrings. If A1
is 2d, the iteration has to be done over all elements, not just the rows. And if A1
is structured array, is has to be applied to the strings within the elements.
Another possibility is to use a converter during load, so you get an array of (unicode) strings
In [508]: A1=np.loadtxt('urls.txt', dtype='U',
converters={0:lambda x:x.decode()})
In [509]: A1
Out[509]:
array(['one.com', 'two.url', 'three.four'], dtype='<U10')
In [510]: np.savetxt('test0.txt',A1,fmt='%s')
In [511]: cat test0.txt
one.com
two.url
three.four
The lib that contains loadtxt
has a couple of converter functions, asbytes
, asbytes_nested
, and asstr
. So converters
could also be: converters={0:np.lib.npyio.asstr}
.
genfromtxt
handles this without converters
:
A1=np.genfromtxt('urls.txt', dtype='U')
# array(['one.com', 'two.url', 'three.four'], dtype='<U10')
To understand why savetxt
save unicode strings as we want, but appends the b
for bytestrings, we have to look at its code.
np.savetxt
(running on py3) is essentially:
fh = open(fname, 'wb')
X = np.atleast_2d(X).T
# make a 'fmt' that matches the columns of X (with delimiters)
for row in X:
fh.write(asbytes(format % tuple(row) + newline))
Looking at two sample strings (str and bytestr):
In [617]: asbytes('%s'%tuple(['one.two']))
Out[617]: b'one.two'
In [618]: asbytes('%s'%tuple([b'one.two']))
Out[618]: b"b'one.two'"
Writing to a 'wb' file removes that outer layer of b''
, leaving the inner for the bytestring. It also explains why strings ('plain' py3 unicode) are written as 'latin1' strings to the file.
You could write a bytestrings array directly, without savetxt
. For example:
A0 = array([b'one.com', b'two.url', b'three.four'], dtype='|S10')
with open('test0.txt','wb') as f:
for x in A0:
f.write(x+b'
')
cat test0.txt
one.com
two.url
three.four
Unicode strings can also be written directly, producing the same file:
A1 = array(['one.com', 'two.url', 'three.four'], dtype='<U10')
with open('test1.txt','w') as f:
for x in A1:
f.write(x+'
')
The default encoding for such a file is encoding='UTF-8'
, the same as used with 'one.com'.encode()
. The effect it is the same as what savetxt
does:
with open('test1.txt','wb') as f:
for x in A1:
f.write(x.encode()+b'
')
np.char
has .encode
and .decode
methods, which appear to operate iteratively on the elements of an array.
Thus
np.char.decode(A1) # convert |S10 to <U10, like [x.decode() for x in A1]
np.char.encode(A1) # convert <U10 to |S10
This works with multidimensional arrays
np.savetxt('testm.txt',np.char.decode(A_bytes[:,None][:,[0,0]]),
fmt='%s',delimiter=', ')
With a structured array, np.char.decode
has to be applied individually to each of the char fields.