Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
518 views
in Technique[技术] by (71.8m points)

r - What is the fastest way to get a vector of sorted unique values from a data.table?

The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a data.table:

# 1
sort(salesdt[, unique(company)])
#2 
sort(unique(salesdt$company))
#3
salesdt[order(company), unique(company)]

Another answer suggested other sort options than lexicographical order:

salesdt[, .N, by = company][order(-N), company]
salesdt[, sum(sales), by = company][order(-V1), company]

The data.table was created by

library(data.table)
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)
salesdt <- data.table(company,item,sales) 

As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the data.table would be much larger. I have searched a bit on SO but haven't found a particular answer so far.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

For benchmarking, a larger data.table is created with 1.000.000 rows:

n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

For the sake of completeness also the variants

# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])

will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.

In addition, two other sort options from this answer are included:

# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]

Edit: Following from Frank's comment, I've included his suggestion:

# 8
salesdt[,logical(1), keyby = company]$company

Benchmarking, no key set

Benchmarking is done with help of the microbenchmark package:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[order(company), unique(company)],
  unique(sort(salesdt$company)),
  unique(salesdt[,sort(company)]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings are displayed with

ggplot2::autoplot(timings)

Please, note the reverse order in the chart (#1 at bottom, #8 at top).

enter image description here

As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.

A bit of surprise to me was variant #3. Despite data.table's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.

Benchmarking, data.table keyed by company

Motivated by this observation I repeated the benchmark with the data.table keyed by company.

setkeyv(salesdt, "company")

The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.

enter image description here

Benchmarking, keyed with a bit of tuning

Variant #3 still includes order(company) which isn't necessary if already keyed by company. So, I removed the unnecessary calls to order and sort from #3 and #5:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[, unique(company)],
  unique(salesdt$company),
  unique(salesdt[, company]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.

enter image description here

Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...