Given two dataframes as below:
>>> import pandas as pd
>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
>>> df_a
a b
0 1 4
1 2 5
2 3 6
>>> df_b
c d
0 2 7
1 3 8
we would like to produce a SQL-style join of both dataframes using a non-simplistic criteria, let's say "df_b.c > df_a.a". From what I can tell, while merge()
is certainly part of the solution, I can't use it directly since it doesn't accept arbitrary expressions for "ON" criteria (unless I'm missing something?).
In SQL, the results look like this:
# inner join
sqlite> select * from df_a join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
# outer join
sqlite> select * from df_a left outer join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
3|6||
my current approach for inner join is to produce a cartesian product
of df_a and df_b, by adding a column of "1"s to both, then using
merge() on the "1"s column, then applying the "c > a" criteria.
>>> import numpy as np
>>> df_a['ones'] = np.ones(3)
>>> df_b['ones'] = np.ones(2)
>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
>>> cartesian
a b ones c d
0 1 4 1 2 7
1 1 4 1 3 8
2 2 5 1 2 7
3 2 5 1 3 8
4 3 6 1 2 7
5 3 6 1 3 8
>>> cartesian[cartesian.c > cartesian.a]
a b ones c d
0 1 4 1 2 7
1 1 4 1 3 8
3 2 5 1 3 8
for outer join, I'm not sure of the best way to go, so far
I've been playing with getting the inner join, then applying the negation
of the criteria to get all the other rows, then trying to edit that
"negation" set onto the original, but it doesn't really work.
Edit. HYRY answered the specific question here but I needed something more generic and more within the Pandas API, as my join criterion could be anything, not just that one comparison. For outerjoin, first I'm adding an extra index to the "left" side that will maintain itself after I do the inner join:
df_a['_left_index'] = df_a.index
then we do the cartesian and get the inner join:
cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
innerjoin = cartesian[cartesian.c > cartesian.a]
then I get the additional index ids in "df_a" that we'll need, and get the rows from "df_a":
remaining_left_ids = set(df_a['_left_index']).
difference(innerjoin['_left_index'])
remaining = df_a.ix[remaining_left_ids]
then we use a straight concat(), which replaces missing columns with "NaN" for left (I thought it wasn't doing this earlier but I guess it does):
outerjoin = pd.concat([innerjoin, remaining]).reset_index()
HYRY's idea to do the cartesian on just those cols that we need to compare on is basically the right answer, though in my specific case it might be a little tricky to implement (generalized and all).
questions:
How would you produce a "join" of df_1 and df_2 on "c > a"? Would
you do the same "cartesian product, filter" approach or is there some better
way?
How would you produce the "left outer join" of same?
See Question&Answers more detail:
os