The operation you are performing can be expressed as an application of np.einsum
-- it's an inner product between each pair of columns:
import numpy as np
import pandas as pd
df = pd.read_table('data', sep='s+')
print(df)
# Al01 BBR60 CA07 NL219
# 0 MP NaN MP MP
# 1 NaN NaN NaN NaN
# 2 NP NaN NP NP
# 3 NaN NP NaN NaN
# 4 PB1 NaN NaN PB1
# 5 NaN NaN NP NP
# 6 NP NaN NaN NaN
arr = (~df.isnull()).values.astype('int')
print(arr)
# [[1 0 1 1]
# [0 0 0 0]
# [1 0 1 1]
# [0 1 0 0]
# [1 0 0 1]
# [0 0 1 1]
# [1 0 0 0]]
result = pd.DataFrame(np.einsum('ij,ik', arr, arr),
columns=df.columns, index=df.columns)
print(result)
yields
Al01 BBR60 CA07 NL219
Al01 4 0 2 3
BBR60 0 1 0 0
CA07 2 0 3 3
NL219 3 0 3 4
Usually when a calculation boils down to a numeric operation independent of indices, it is faster to do it with NumPy than with Pandas. That appears to be the case here:
In [130]: %timeit df2 = df.applymap(lambda x: int(not pd.isnull(x))); df2.T.dot(df2)
1000 loops, best of 3: 1.12 ms per loop
In [132]: %timeit arr = (~df.isnull()).values.astype('int'); pd.DataFrame(np.einsum('ij,ik', arr, arr), columns=df.columns, index=df.columns)
10000 loops, best of 3: 132 μs per loop
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…