Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
519 views
in Technique[技术] by (71.8m points)

r - Quickly remove zero variance variables from a data.frame

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

set.seed(1)
dat <- data.frame(
    A=factor(rep("X",10),levels=c('X','Y')),
    B=round(runif(10)*10),
    C=rep(10,10),
    D=c(rep(10,9),1),
    E=factor(rep("A",10)),
    F=factor(rep(c("I","J"),5)),
    G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
    out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
    which(out==1)
}

And here's the result of the process:

> dat
   A B  C  D E F  G
1  X 3 10 10 A I 10
2  X 4 10 10 A J 10
3  X 6 10 10 A I 10
4  X 9 10 10 A J 10
5  X 2 10 10 A I 10
6  X 9 10 10 A J 10
7  X 9 10 10 A I 10
8  X 7 10 10 A J 10
9  X 6 10 10 A I 10
10 X 1 10  1 A J NA

> dat[,-zeroVar(dat)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

> dat[,-zeroVar(dat, useNA = 'no')]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may also want to look into the nearZeroVar() function in the caret package.

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...