I have been working on a fuzzyjoin
to join 2 data frames together however due to memory issues the join causes cannot allocate memory of…
. So I am trying to join the data using data.table
. A sample of the data is below.
df1 looks like:
ID f_date ACCNUM flmNUM start_date end_date
1 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30
2 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19
3 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05
4 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13
5 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20
6 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24
7 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17
8 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03
9 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10
10 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20
df2 looks like:
ID date fyear at lt
1 50341 1998-12-31 1998 104382 94973
2 50341 1999-12-31 1999 190692 175385
3 50341 2000-12-31 2000 179519 163347
4 50341 2001-12-31 2001 203638 186030
5 50341 2002-12-31 2002 190453 173620
6 50341 2003-12-31 2003 200235 181955
I will focus on the ID
= 50341
. If df2$date
is in the time period of df1$start_date
and df1$end_date
then join them together. So here df2$date
= 2002-12-31
which is in between df1
start 2002-09-07
and end 2003-08-30
, therefore join this row.
I run the following code and get the corresponding output:
df1$f_date <- as.Date(df1$f_date)
df2$date <- as.Date(df2$date)
df1$start_date <- df1$f_date + 183
df1$end_date <- df1$f_date + 540
library(fuzzyjoin)
final_data <- fuzzy_left_join(
df1, df2,
by = c(
"ID" = "ID",
"start_date" = "date",
"end_date" = "date"
),
match_fun = list(`==`, `<`, `>=`)
)
final_data
Output:
ID.x f_date ACCNUM flmNUM start_date end_date ID.y date fyear at lt
1 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30 50341 2002-12-31 2002 190453.000 173620.000
2 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19 1067983 2010-12-31 2010 372229.000 209295.000
3 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05 804753 2004-12-31 2004 982.265 383.614
4 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13 1090727 2013-12-31 2013 36212.000 29724.000
5 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20 1467858 2010-12-31 2010 138898.000 101739.000
6 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24 NA <NA> NA NA NA
7 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2488 2016-12-31 2016 3321.000 2905.000
8 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03 NA <NA> NA NA NA
9 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10 1467858 2017-12-31 2017 212482.000 176282.000
10 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 14693 2016-04-30 2015 4183.000 2621.000
Here we can see that ID
= 50341
is joined up correctly.
When I try to run the data.table
way I get this output:
Code:
dt_final_data <- setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]
Output:
ID date fyear at lt date.1 f_date ACCNUM flmNUM
1: 50341 2002-09-07 2002 190453.000 173620.000 2003-08-30 2002-03-08 0001104659-02-000656 2571187
2: 1067983 2010-05-27 2010 372229.000 209295.000 2011-05-19 2009-11-25 0001047469-09-010426 91207220
3: 804753 2004-11-13 2004 982.265 383.614 2005-11-05 2004-05-14 0001193125-04-088404 4805453
4: 1090727 2013-11-21 2013 36212.000 29724.000 2014-11-13 2013-05-22 0000712515-13-000022 13865105
5: 1467858 2010-08-28 2010 138898.000 101739.000 2011-08-20 2010-02-26 0001193125-10-043035 10640035
6: 858877 2019-08-02 NA NA NA 2020-07-24 2019-01-31 0001166691-19-000005 19556540
7: 2488 2016-08-25 2016 3321.000 2905.000 2017-08-17 2016-02-24 0001193125-16-476010 161452982
8: 1478242 2004-09-11 NA NA NA 2005-09-03 2004-03-12 0001193125-04-039482 4664082
9: 1467858 2017-08-18 2017 212482.000 176282.000 2018-08-10 2017-02-16 0001555280-17-000044 17618235
10: 14693 2016-04-28 2015 4183.000 2621.000 2017-04-20 2015-10-28 0001193125-15-356351 151180619
dt_final_data
Here start_date
in df1
has now become date
and end_date
in df1
has become date.1
. Therefore my original date
column in df2
has disappeared which is one of the more important dates for checking if the merge worked as it should.
Two questions:
How can I keep all the date columns as in the fuzzyjoin
example? The way data.table
has changed the names makes it a little confusing when I am checking the join.
Is the code/logic correct? I have looked at this joined data a number of times and it "appears" correct.
Data1:
df1 <-
structure(list(ID = c(50341L, 1067983L, 804753L, 1090727L, 1467858L,
858877L, 2488L, 1478242L, 1467858L, 14693L), f_date = structure(c(11754,
14573, 12552, 15847, 14666, 17927, 16855, 12489, 17213, 16736
), class = "Date"), ACCNUM = c("0001104659-02-000656", "0001047469-09-010426",
"0001193125-04-088404", "0000712515-13-000022", "0001193125-10-043035",
"0001166691-19-000005", "0001193125-16-476010", "0001193125-04-039482",
"0001555280-17-000044", "0001193125-15-356351"), flmNUM = c(2571187L,
91207220L, 4805453L, 13865105L, 10640035L, 19556540L, 161452982L,
4664082L, 17618235L, 151180619L),
start_date = structure(c(11937, 14756, 12735, 16030, 14849, 18110, 17038,
12672, 17396, 16919), class = "Date"),
end_date = structure(c(12294, 15113, 13092, 16387, 15206, 18467, 17395, 13029,
17753, 17276), class = "Date")
), row.names = c(NA, -10L), class = "data.frame")
Data2:
df2 <-
structure(list(ID = c(2488L, 2488L, 2488L, 2488L, 2488L, 2488L,
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L,
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 1067983L, 1067983L,
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L,
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L,
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 14693L, 14693L,
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L,
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L,
14693L, 14693L, 14693L, 50341L, 50341L, 50341L, 50341L, 50341L,
50341L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L,
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L,
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L,
1090727L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L,
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L,
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L,
804753L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L,
1478242L, 1478242L, 1478242L, 1478242L, 858877L, 858877L, 858877L,
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L,
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L,
858877L, 858877L, 858877L, 858877L), date = structure(c(10591,
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878,
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166,
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783,
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070,
16435, 16800, 17166, 17531, 17896, 10346, 10711, 11077, 11442,
11807, 12172, 12538, 12903, 13268, 13633, 13999, 14364, 14729,
15094, 15460, 15825, 16190, 16555, 16921, 17286, 17651, 10591,
10956, 11322, 11687, 12052, 12417, 10591, 10956, 11322, 11687,
12052, 12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974,
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896, 10591,
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878,
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166,
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783,
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070,
16435, 16800, 17166, 17531, 17896, 14609, 14974, 15339, 15705,
16070, 16435, 16800, 17166, 17531, 17896, 10438, 10803, 11169,
11534, 11899, 12264, 12630, 12995, 13360, 13725, 14091, 14456,
14821, 15186, 15552, 15917, 16282, 16647, 17013, 17378, 17743
), class = "Date"), fyear = c(1998L, 1999L, 2000L, 2001L, 2002L,
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L, 1998L, 1999L, 2000L, 2001L, 2002L,
2003L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L,
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L,
2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L, 2018L), at = c(4252.968, 4377.698,
5767.735, 5647.242, 5619.181, 7094.345, 7844.21, 7287.779, 13147,
11550, 7675, 9078, 4964, 4954, 4000, 4337, 3767, 3109, 3321,
3540, 4556, 122237, 131416, 135792, 162752, 169544, 180559, 188874,
198325, 248437, 273160, 267399, 297119, 372229, 392647, 427452,
484931, 526186, 552257, 620854, 702095, 707794, 1494, 1735, 1802,
1939, 2016, 2264, 2376, 2624, 2728, 3551, 3405, 3475, 3383, 3712,
3477, 3626, 4103, 4193, 4183, 4625, 4976, 104382, 190692, 179519,
203638, 190453, 200235, 257389, 274730, 303100, 323969, 370782,
448507, 479921, 476078, 186192, 148883, 91047, 136295, 138898,
144603, 149422, 166344, 177677, 194520, 221690, 212482, 227339,
17067, 23043, 21662, 24636, 26357, 28909, 33026, 35222, 33210,
39042, 31879, 31883, 33597, 34701, 38863, 36212, 35471, 38311,
40377, 45403, 50016, 436.485, 660.891, 616.411, 712.302, 779.279,
859.34, 982.265, 1303.629, 1491.39, 1689.956, 1880.988, 2148.567,
2422.79, 3000.358, 3704.468, 4098.364, 4530.565, 5561.984, 5629.963,
6469.