This thread is a continuation of my earlier thread Join then mutate using data.table without intermediate table.
In that thread, I am using look-up table to change revenue and quantity and then dividing the result by .N
so that when I aggregate the products, I don't see inflated values.
As per recommendation from the expert on that thread, I don't want to count on all the four variables used for join i.e. PO_ID
, SO_ID
, F_Year
, Product_ID
but only SO_ID
, F_Year
, Product_ID
.
Question: how can I do this using data.table
?
Here are my data and code:
Here are my data and solution using dplyr
Input
DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S2", "S2", "S2", "S2", "S3", "S3", "S7", "S10"), F_Year = c(2012,
2012, 2013, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "450X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1,
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA",
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")
Look Up Table
DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345",
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3",
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011,
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700",
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100),
Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA,
7L), class = "data.frame")
Here's my modified code using dplyr
:
DF_Generated <- DFI %>%
left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
dplyr::group_by(SO_ID, F_Year, Product_ID) %>%
dplyr::mutate(Count = n()) %>%
dplyr::ungroup()%>%
dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)
Please note that input to group_by
has changed.
Expected output:
DF_Generated = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S2", "S2", "S2", "S2", "S3", "S3", "S7", "S10"), F_Year = c(2012,
2012, 2013, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "450X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Location1 = c("MA", "NY", "WA", "NY", "WA", "NY", "IL",
"IL", "MN", "CA"), Revenue = c(25, 25, 23.3333333333333, 23.3333333333333,
23.3333333333333, 35, 50, 50, 50, 100), Quantity = c(1.5, 1.5,
6.66666666666667, 6.66666666666667, 6.66666666666667, 20, 10,
10, 20, 40)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L), .Names = c("PO_ID", "SO_ID", "F_Year", "Product_ID", "Location1",
"Revenue", "Quantity"))
NOTE: Please note that I don't want to create intermediate variable because the actual data size is so large that this may not be feasible.
See Question&Answers more detail:
os