I have two dataframes (x & y) where the IDs are student_name
, father_name
and mother_name
. Because of typographical errors ("n" instead of "m", random white spaces, etc.), I have about 60% of values which are not aligning, though I can eyeball the data and see they should. Is there a way to reduce the level of non-match somehow so that manually editing because at least feasible? The dataframes are have about 700K observations.
R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on agrep()
, but don't understand how that can work on actual datasets, especially when the match is over more than one variable.
update (data for posted bounty):
Here are two example data frames, sites_a
and sites_b
. They could be matched on the numeric columns lat
and lon
as well as on the sitename
column. It would be useful to know how this could be done on a) just lat
+ lon
, b) sitename
or c) both.
you can source the file test_sites.R which is posted as a gist.
Ideally the answer would end with
merge(sites_a, sites_b, by = **magic**)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…