Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
257 views
in Technique[技术] by (71.8m points)

r - Implementation of skyline query or efficient frontier

I know there must be an easy answer to this but somehow I can't seem to find it...

I have a data frame with 2 numeric columns. I would like to remove from it, the rows, which have the property, that there exists at least one other row in the data frame, with both column values bigger than the ones in this row.

So if I have

    Col1 Col2  
1     2    3  
2     4    7  
3     5    6  

I would like to remove the first row, because the second one fulfills the property and keep only rows 2 and 3.

Thanks a lot!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists. Plotting the data can make it clear what we are looking for.

n <- 40
d <- data.frame(
  x = rnorm(n),
  y = rnorm(n)
)
# We want the "extreme" points in the following plot
par(mar=c(1,1,1,1))
plot(d, axes=FALSE, xlab="", ylab="")
for(i in 1:n) {
  polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]), 
  col=rgb(.9,.9,.9,.2))
}

The algorithm is as follows: sort the points along the first coordinate, keep each observation unless it is worse than the last retained one.

d <- d[ order(d$x, decreasing=TRUE), ]
result <- d[1,]
for(i in seq_len(nrow(d))[-1] ) {
  if( d$y[i] > result$y[nrow(result)] ) {
    result <- rbind(result, d[i,])  # inefficient
  } 
}
points(result, cex=3, pch=15)

Skyline


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...