I have a group of dataframes each with many columns. In each dataframe some of the columns are named as Test_1
, Test_2
, Test_3
etc., with the total number of test columns varying.
Some (but not all!) of the tests have associated results. Results are in their own columns, named as Test_1_Result
, Test_2_Result
etc. Result are always in order, with no skipping (so no Test_Result_4
without Test_Result_3
).
There may also be any number of other non test/result columns in the dataframe.
Here's a toy example:
df <- data.frame(
Name = c("A", "B", "C"),
Test_1 = c("Standard", "Standard", "Standard"),
Test_1_Result = c("Pass", "Fail", "Pass"),
Test_2 = c("Sepcial", "Special", "Special"),
Test_2_Result = c("Pass", "Fail", "Fail"),
Test_3 = c("Unknown", "Unknown", "Unknown"),
Test_4 = c(NA, NA, NA),
Col_1 = c(1, 2, 3),
Col_2 = c(1.2, 2.2, 3.2),
Other_Col = c(1.3, 2.3, 3.3)
)
What I want to do is remove the test columns that do not have a corresponding result column, ideally using dplyr::select
.
I've been able to identify the highest result column like so, but have gotten no further
library(stringr)
numbs <- str_extract(names(df), "\d_")[is.na(str_extract(names(df), "\d_")) == FALSE]
numbs <- as.numeric(str_remove(numbs, "_"))
paste0("Test_", max(numbs), "_Result")
[1] "Test_2_Result"
What I'd like is this (Test_3
and Test_4
removed).
df_target <- data.frame(
Name = c("A", "B", "C"),
Test_1 = c("Standard", "Standard", "Standard"),
Test_1_Result = c("Pass", "Fail", "Pass"),
Test_2 = c("Sepcial", "Special", "Special"),
Test_2_Result = c("Pass", "Fail", "Fail"),
Col_1 = c(1, 2, 3),
Col_2 = c(1.2, 2.2, 3.2),
Other_Col = c(1.3, 2.3, 3.3)
)
I understand that in this toy example it's simple to just use dplyr::select(-Test_3, -Test_4)
but the real dataframe will have lots of test columns to remove. I'd like to avoid entering them all manually.
question from:
https://stackoverflow.com/questions/65925797/select-dataframe-columns-based-on-number-in-column-name