Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
158 views
in Technique[技术] by (71.8m points)

python - Pandas: how to merge two dataframes on a column by keeping the information of the first one?

I have two dataframes df1 and df2. df1 contains the information of the age of people, while df2 contains the information of the sex of people. Not all the people are in df1 nor in df2

df1
     Name   Age 
0     Tom    34
1     Sara   18
2     Eva    44
3     Jack   27
4     Laura  30

df2
     Name      Sex 
0     Tom       M
1     Paul      M
2     Eva       F
3     Jack      M
4     Michelle  F

I want to have the information of the sex of the people in df1 and setting NaN if I do not have this information in df2. I tried to do df1 = pd.merge(df1, df2, on = 'Name', how = 'outer') but I keep the information of some people in df2 that I don't want.

df1
     Name   Age     Sex
0     Tom    34      M
1     Sara   18     NaN
2     Eva    44      F
3     Jack   27      M
4     Laura  30     NaN
Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

#print (df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Year':[2000,2003,2003,2004,2007],
                    'Code':[1,2,3,4,4],
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Year  Code  Age
0    Tom  2000     1   34
1   Sara  2003     2   18
2    Eva  2003     3   44
3   Jack  2004     4   27
4  Laura  2007     4   30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F'],
                    'Year':[2001,2003,2003,2004,2007],
                    'Code':[1,2,3,5,3],
                    'Val':[21,34,23,44,67]})
print (df2)
       Name Sex  Year  Code  Val
0       Tom   M  2001     1   21
1      Paul   M  2003     2   34
2       Eva   F  2003     3   23
3      Jack   M  2004     5   44
4  Michelle   F  2007     3   67
#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
  Name_x  Year  Code  Age Name_y  Sex   Val
0    Tom  2000     1   34    NaN  NaN   NaN
1   Sara  2003     2   18   Paul    M  34.0
2    Eva  2003     3   44    Eva    F  23.0
3   Jack  2004     4   27    NaN  NaN   NaN
4  Laura  2007     4   30    NaN  NaN   NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
    Name  Year  Code  Age   Val
0    Tom  2000     1   34   NaN
1   Sara  2003     2   18  34.0
2    Eva  2003     3   44  23.0
3   Jack  2004     4   27   NaN
4  Laura  2007     4   30   NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Age
0    Tom   34
1   Sara   18
2    Eva   44
3   Jack   27
4  Laura   30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                    'Val': [1,2,3,4,5]})
print (df2)
       Name  Val
0       Tom    1 <-duplicated name Tom
1       Tom    2 <-duplicated name Tom
2       Eva    3
3      Jack    4
4  Michelle    5

s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom         1
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)
print (df1)
    Name  Age  New
0    Tom   34  1.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom         2
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)
print (df3)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)
print (df4)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...