Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
255 views
in Technique[技术] by (71.8m points)

r - Calculating a distance matrix by dtw

I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterward use that for clustering but seems too complicated. I did so; who can help for more clarification please? Thanks a lot

> head(control[,1:4])
               MAST2     WWC2  PHYHIPL   R3HDM2
Control_D1  6.591024 5.695156 3.388652 5.756384
Control_D1 8.043454 5.365221 6.859768 6.936970
Control_D3 7.731590 4.868267 6.919972 6.931073
Control_D4 8.129948 5.105528 6.627016 7.090268
Control_D5 7.690863 4.729501 6.824746 6.904610
Control_D6 8.101723 5.334501 6.868990 7.115883
> 

> head(lead[,1:4])
              MAST2     WWC2  PHYHIPL   R3HDM2
Lead30_D1  6.418423 5.610699 3.734425 5.778046
Lead30_D2 7.918360 4.295191 6.559294 6.780952
Lead30_D3 7.807142 4.294722 6.599187 6.716040
Lead30_D4 7.856720 4.432136 6.572337 6.848483
Lead30_D5 7.827311 4.204738 6.607107 6.784094
Lead30_D6 7.848760 4.458451 6.581216 6.943003
>
> dim(control)
[1]   26 2603
> dim(lead)
[1]   26 2603
library(dtw)

for (i in control) { 
  for (j in lead) { 
    result[i,j] <- dtw( dist(control[,,i],lead[,,j]), distance.only=T )$normalizedDistance 
  }
}

Says that

Error in lead[, , j] : incorrect number of dimensions
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There have already been questions similar to yours, but the answers haven't been too detailed. Here's a breakdown of what you need to know, in the specific case of R.

Calculating cross-distance matrices

The proxy package is made specifically for the calculation of cross-distance matrices. You should check its vignette to know which measures are already implemented by it. An example of its use:

set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)

suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean", 
                               upper = TRUE, diag = TRUE)
print(distance_matrix)
#>          1        2        3        4        5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000

Note: in the context of time series, proxy treats each row in a matrix as a series, which can be confirmed by the fact that sample_data above is a 5x10 matrix and the resulting cross-distance matrix is 5x5.

Using the DTW distance

The dtw package implements many variations of DTW, and it also leverages proxy. You could calculate a DTW distance matrix with:

suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw", 
                           upper = TRUE, diag = TRUE)

print(distance_matrix)
#>          1        2        3        4        5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000

Using custom distances

One nice thing about proxy is that it gives you the option to register custom functions. You seem to be interested in the normalized version of DTW, so you could do something like this:

ndtw <- function(x, y = NULL, ...) {
    dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}

pr_DB$set_entry(
  FUN = ndtw,
  names = "ndtw",
  loop = TRUE,
  distance = TRUE
)

ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
                            upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#>           1         2         3         4         5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000

See the documentation of pr_DB for more information.

Other DTW implementations

The dtwclust package (which I made) implements a basic but faster version of DTW which can use multi-threading and also leverages proxy:

suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#>      [,1]      [,2]      [,3]      [,4]      [,5]     
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000

The dtw_basic implementation only supports two step patterns and one window type, but it is considerably faster:

suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
  proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
  proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)

Unit: microseconds
                                                                                        expr      min       lq     mean
 proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba",      window.size = 5L) 5279.124 5621.742 6070.069
                            proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)  657.966  710.418  776.474
   median       uq       max neval cld
 5802.354 6348.199 10411.000   100   b
  752.282  814.037  1161.626   100  a

Another multi-threaded implementation is included in the parallelDist package, although I haven't personally tested it.

Multivariate or multi-dimensional time series

A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns. DTW also works for them:

mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421

The nice thing about proxy is that it can calculate distances between objects contained in lists too, so you can put several multivariate series in lists of matrices:

mv_series <- lapply(1L:5L, function(dummy) {
  matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})

mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#>      [,1]     [,2]     [,3]     [,4]     [,5]    
#> [1,]  0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599  0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470  0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376  0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111  0.00000

Your case

Regardless of what you choose, you can probably use proxy to get your result, but since you haven't provided your whole data, I can't give you a more specific example. I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE) would give you the distance between one pair of series, assuming you're treating each one as a multivariate series with 4 variables.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...