Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
307 views
in Technique[技术] by (71.8m points)

r - Filter each column of a data.frame based on a specific value

Consider the following data frame:

df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))

#   X1 X2 X3 X4 X5
#1   7  9  8  4 10
#2   2  4  9  4  9
#3   2  7  8  8  6
#4   8  9  6  6  4
#5   5  2  1  4  6
#6   8  2  2  1  7
#7   3  8  6  1  6
#8   3  8  5  9  8
#9   6  2  3 10  7
#10  2  7  4  2  9

Using dplyr, how can I filter, on each column (without implicitly naming them), for all values greater than 2.

Something that would mimic an hypothetical filter_each(funs(. >= 2))

Right now I'm doing:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)

Which is equivalent to:

df %>% filter(!rowSums(. < 2))

Note: Let's say I wanted to filter only on the first 4 columns, I would do:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) 

or

df %>% filter(!rowSums(.[-5] < 2))

Would there be a more efficient alternative ?

Edit: sub question

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

Benchmark sub question

Since I have to run this on a large dataset, I benchmarked the suggestions.

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)

Here are the results:

#Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval
#   Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458    50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669    50
# Docendo  874.0247  933.1399  983.5435  985.3697 1026.901 1053.407    50

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the .dots argument of filter_(). First a function that creates an unevaluated call.

Call <- function(x, value, fun = ">=") call(fun, as.name(x), value)

Now we use filter_(), passing a list of calls into the .dots argument using lapply(), choosing any name and value you want.

nm <- names(df) != "X5"
filter_(df, .dots = lapply(names(df)[nm], Call, 2L))
#   X1 X2 X3 X4 X5
# 1  6  5  7  3  1
# 2  8 10  3  6  5
# 3  5  7 10  2  5
# 4  3  4  2  9  9
# 5  8  3  5  6  2
# 6  9  3  4 10  9
# 7  2  9  7  9  8

You can have a look at the unevaluated calls created by Call(), for example X4 and X5, with

lapply(names(df)[4:5], Call, 2L)
# [[1]]
# X4 >= 2L
#
# [[2]]
# X5 >= 2L

So if you adjust the names() in the X argument of lapply(), you should be fine.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...