I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For purely numeric dataframes, this is easy to do with the to_records()
method. I also need the dtypes of pandas columns to be converted to strings rather than objects so that I can use the numpy method tofile()
which will output numbers and strings to a binary file, but will not output objects.
In a nutshell, I need to convert pandas columns with dtype=object
to numpy structured arrays of string or unicode dtype.
Here's an example, with code that would be sufficient if all columns had a numerical (float or int) dtype.
import pandas as pd
df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3],
'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})
struct_arr=df.to_records(index=False)
print('struct_arr',struct_arr.dtype,'
')
# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'),
# ('char', 'O'), ('mixed', 'O')])
But because I want to end up with string dtypes, I need to add this additional and somewhat involved code:
lst=[]
for col in struct_arr.dtype.names: # this was the only iterator I
# could find for the column labels
dt=struct_arr[col].dtype
if dt == 'O': # this is 'O', meaning 'object'
# it appears an explicit string length is required
# so I calculate with pandas len & max methods
dt = 'U' + str( df[col].astype(str).str.len().max() )
lst.append((col,dt))
struct_arr = struct_arr.astype(lst)
print('struct_arr',struct_arr.dtype)
# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'),
# ('char', '<U3'), ('mixed', '<U2')])
See also: How to change the dtype of certain columns of a numpy recarray?
This seems to work, as the character and mixed dtypes are now <U3
and <U2
rather than 'O' or 'object'. I'm just checking if there is a simpler or more elegant approach. But since pandas does not have a native string type as numpy does, maybe there is not?
See Question&Answers more detail:
os