The answer in a nutshell is you should generally use structured arrays rather than recarrays because structured arrays are faster and the only advantage of recarrays is to allow you to write arr.x
instead of arr['x']
, which can be a convenient shortcut, but also error prone if your column names conflict with numpy methods/attributes.
See this excerpt from @jakevdp's book for a more detailed explanation. In particular, he notes that simply accessing columns of structured arrays can be around 20x to 30x faster than accessing columns of recarrays. However, his example uses a very small dataframe with just 4 rows and doesn't perform any standard operations.
For simple operations on larger dataframes, the difference is likely to be much smaller although structured arrays are still faster. For example, here's are a structured and record array each with 10,000 rows (code to create the arrays from a dataframe borrowed from @jpp answer here).
n = 10_000
df = pd.DataFrame({ 'x':np.random.randn(n) })
df['y'] = df.x.astype(int)
rec_array = df.to_records(index=False)
s = df.dtypes
struct_array = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))
If we do a standard operation such as multiplying a column by 2 it's about 50% faster for the structured array:
%timeit struct_array['x'] * 2
9.18 μs ± 88.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit rec_array.x * 2
14.2 μs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…