Using python, I want to merge on multiple variables; A, B, C, but when realization a-b-c in one dataset is missing, use the finer combination that the observation has (like b-c).
Example:
Suppose I have a dataset (df1) containing person's characteristics (gender, married, city). And another dataset (df2) that I have the median income of a person according to their gender, city, married (created with a groupby).
Then I want to input that median income into the first dataset (df1) matching in as many characterisics as possible. That is if individual has characteristics gender-city-married that has median income, use that value. If the individual has characteristics that there is only city-married median income, to use that value.
Something like that
df1 = pd.DataFrame({'Male':['0', '0', '1','1'],'Married':['0', '1', '0','1'], 'City': ['NY', 'NY', 'NY', 'NY']})
Male Married City
0 0 NY
0 1 NY
1 0 NY
1 1 NY
df2 = pd.DataFrame({'Male':['0', '0', '1'],'Married':['0', '1', '1'], 'City': ['NY', 'NY','NY'], 'income':['300','400', '500']})
Male Married City income
0 0 NY 300
0 1 NY 400
1 1 NY 500
'''
and the desired outcome:
'''
desired_df1:
Male Married City income
0 0 NY 300
0 1 NY 400
1 0 NY 300
1 1 NY 400
I was thinking to do a 1st merge by=['male','married','city']
, and then fill missing values from a 2nd merge by=['married','city']
. But I think there should be a more systematic and simpler way. Any suggestions?
Thanks and sorry if formulation is not correct or it is duplicate (I look deeply and didn't find anything).
question from:
https://stackoverflow.com/questions/65830363/merging-on-pandas-reduce-the-set-of-merging-variables-when-match-is-not-possibl