r - Remove duplicate rows of a matrix or dataframe

Question

Welcome To Ask or Share your Answers For Others

r - Remove duplicate rows of a matrix or dataframe

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Remove duplicate rows of a matrix or dataframe

I want to check which rows of the matrix or dataframe are duplicate, how can we find it?

We want to remove duplicate rows. Duplicate rows are rows which have the same values in both columns 1 and 2 by ignoring their ordering.

For example, for the following matrix:

Col1   Col2     database 
 A       B       IntAct
 A       B       Bind
 B       A       BioGrid

I want to have only one of the rows.

Col1   Col2     database 
 A       B       IntAct

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:09:04+0000

Here is another option using pmax/pmin

library(data.table)
setDT(df1)[!duplicated(pmin(Col1, Col2), pmax(Col1, Col2))]
#   Col1 Col2 database
#1:    A    B   IntAct

Benchmarking with bigger data:

# dummy data
set.seed(123)
df <- data.frame(Col1 = sample(c("A", "B", "C"), 1000, replace = TRUE),
                 Col2 = sample(c("A", "B", "C"), 1000, replace = TRUE),
                 database = sample(c("IntAct", "Bind", "BioGrid"), 1000,
                                   replace = TRUE), stringsAsFactors = FALSE)
# benchmark
microbenchmark::microbenchmark(
  t = df[ !duplicated(t(apply(df[, 1:2], 1, sort))), ] ,
  paste = df[ !duplicated(apply(df[, 1:2], 1,
                                function(i)paste(sort(i), collapse = ","))), ],
  pmin = df[ !duplicated(cbind(pmin(df[, 1], df[, 2]), pmax(df[, 1], df[, 2]))), ],
  times = 1000)

# Unit: milliseconds
#   expr      min        lq      mean    median        uq       max neval cld
#      t 33.49008 36.337253 38.374825 37.420015 39.610627 153.89251  1000   b
#  paste 33.24177 36.102055 38.079015 37.330498 39.465803 151.43734  1000   b
#   pmin  2.59116  2.790864  3.034999  2.910316  3.137389  11.99905  1000  a

Categories

r - Remove duplicate rows of a matrix or dataframe

r - Remove duplicate rows of a matrix or dataframe

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags