Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
120 views
in Technique[技术] by (71.8m points)

r - Select dataframe columns based on number in column name

I have a group of dataframes each with many columns. In each dataframe some of the columns are named as Test_1, Test_2, Test_3 etc., with the total number of test columns varying.

Some (but not all!) of the tests have associated results. Results are in their own columns, named as Test_1_Result, Test_2_Result etc. Result are always in order, with no skipping (so no Test_Result_4 without Test_Result_3).

There may also be any number of other non test/result columns in the dataframe.

Here's a toy example:

df <- data.frame(
  Name = c("A", "B", "C"),
  Test_1 = c("Standard", "Standard", "Standard"),
  Test_1_Result = c("Pass", "Fail", "Pass"),
  Test_2 = c("Sepcial", "Special", "Special"),
  Test_2_Result = c("Pass", "Fail", "Fail"),
  Test_3 = c("Unknown", "Unknown", "Unknown"),
  Test_4 = c(NA, NA, NA),
  Col_1 = c(1, 2, 3),
  Col_2 = c(1.2, 2.2, 3.2),
  Other_Col = c(1.3, 2.3, 3.3)
)

What I want to do is remove the test columns that do not have a corresponding result column, ideally using dplyr::select.

I've been able to identify the highest result column like so, but have gotten no further

library(stringr)
numbs <- str_extract(names(df), "\d_")[is.na(str_extract(names(df), "\d_")) == FALSE]
numbs <- as.numeric(str_remove(numbs, "_"))
paste0("Test_", max(numbs), "_Result")
[1] "Test_2_Result"

What I'd like is this (Test_3 and Test_4 removed).

df_target <- data.frame(
  Name = c("A", "B", "C"),
  Test_1 = c("Standard", "Standard", "Standard"),
  Test_1_Result = c("Pass", "Fail", "Pass"),
  Test_2 = c("Sepcial", "Special", "Special"),
  Test_2_Result = c("Pass", "Fail", "Fail"),
  Col_1 = c(1, 2, 3),
  Col_2 = c(1.2, 2.2, 3.2),
  Other_Col = c(1.3, 2.3, 3.3)
)

I understand that in this toy example it's simple to just use dplyr::select(-Test_3, -Test_4) but the real dataframe will have lots of test columns to remove. I'd like to avoid entering them all manually.

question from:https://stackoverflow.com/questions/65925797/select-dataframe-columns-based-on-number-in-column-name

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is not elegant at all but would remove the desired columns. This assumes the consistent naming/structure of the Test_* and Test_*_Result columns.

library(stringr)
library(dplyr)

test_cols <- str_remove(str_subset(names(df), "^Test_"), "_Result")
keep_these <- test_cols[duplicated(test_cols)]
drop_these <- test_cols[!test_cols %in% keep_these]
df %>% 
  select(-all_of(drop_these))

  Name   Test_1 Test_1_Result  Test_2 Test_2_Result Col_1 Col_2 Other_Col
1    A Standard          Pass Sepcial          Pass     1   1.2       1.3
2    B Standard          Fail Special          Fail     2   2.2       2.3
3    C Standard          Pass Special          Fail     3   3.2       3.3

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...