If you have a list of your dataframes dfs
:
dfs = [df1, df2, df3, ... , dfn]
you can join them using panda's concat
function which as far as I can tell is faster than chaining merge. concat
only joins dataframes based on an index (not a column) but with a little pre-processing you can simulate a merge
operation.
First replace the index of each of your dataframes in dfs
with the column you want to merge on. Lets say you want to merge on column "A"
:
dfs = [df.set_index("A", drop=True) for df in dfs]
Note that this will overwrite the previous indices (merge would do this anyway) so you might want to save these indices somewhere (if you are going to need them later for some reason).
Now we can use concat which will essentially merge on the index (which is actually your column!!)
merged = pd.concat(dfs, axis=1, keys=range(len(dfs)), join='outer', copy=False)
The join=
argument can either be 'inner'
or 'outer'
(default). The copy=
argument keeps concat
from making unnecessary copies of your dataframes.
You can then either leave "A"
as the index or you can make it back into a column by doing:
merged.reset_index(drop=False, inplace=True)
The keys=
argument is optional and assigns a key value to each dataframe (in this case I gave it a range of integers but you could give them other labels if you want). This allows you to access columns from the original dataframes. So if you wanted to get the columns that correspond to the 20th dataframe in dfs
you can call:
merged[20]
Without the keys=
argument it can get confusing which rows are from which dataframes, especially if they have the same column names.
I'm still not entirely sure if concat
runs in linear time but it is definitely faster than chaining merge
:
using ipython's %timeit on lists of randomly generated dataframes (lists of 10, 100 and 1000 dataframes):
def merge_with_concat(dfs, col):
dfs = [df.set_index(col, drop=True) for df in dfs]
merged = pd.concat(dfs, axis=1, keys=range(len(dfs)), join='outer', copy=False)
return merged
dfs10 = [pd.util.testing.makeDataFrame() for i in range(10)]
dfs100 = [pd.util.testing.makeDataFrame() for i in range(100)]
dfs1000 = [pd.util.testing.makeDataFrame() for i in range(1000)]
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs10)
10 loops, best of 3: 45.8 ms per loop
%timeit merge_with_concat(dfs10,"A")
100 loops, best of 3: 11.7 ms per loop
%timeit merge_with_concat(dfs100,"A")
10 loops, best of 3: 139 ms per loop
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs100)
1 loop, best of 3: 1.55 s per loop
%timeit merge_with_concat(dfs1000,"A")
1 loop, best of 3: 9.67 s per loop
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs1000)
# I killed it after about 5 minutes so the other one is definitely faster