Picking up Alistaire's and Nettle's suggestions and transforming into a working solution
df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
# A tibble: 10 x 3
state_abbrev value state_name
<chr> <int> <chr>
1 AL 671 Alabama
2 AK 501 Alaska
3 AZ 1030 Arizona
4 AR 694 Arkansas
5 CA 881 California
6 CO 821 Colorado
7 CT 742 Connecticut
8 DE 665 Delaware
9 FL 948 Florida
10 GA 790 Georgia
The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table
package:
library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
state_abbrev state_name value
1: AL Alabama 1103
2: AK Alaska 1036
3: AZ Arizona 811
4: AR Arkansas 604
5: CA California 868
6: CO Colorado 1129
7: CT Connecticut 819
8: DE Delaware 1194
9: FL Florida 888
10: GA Georgia 501
Benchmark
library(bench)
bm <- press(
na_share = c(0.1, 0.5, 0.9),
n_row = length(state.abb) * 2 * c(1, 100, 10000),
{
n_na <- na_share * length(state.abb)
set.seed(1)
na_idx <- sample(length(state.abb), n_na)
tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
lookup_df <-tmp[na_idx]
tmp[na_idx, state_name := NA]
df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
mark(
dplyr = {
df1 <- copy(df0)
df1 <- df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
df1
},
upd_join = {
df1 <- copy(df0)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
}
)
}
)
ggplot2::autoplot(bm)
data.table
's upate join is always faster (note the log time scale).
As the update join modifies the data object, a fresh copy is used for each benchmark run.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…