Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
138 views
in Technique[技术] by (71.8m points)

r - Why does data.table update names(DT) by reference, even if I assign to another variable?

I've stored the names of a data.table as a vector:

library(data.table)
set.seed(42)
DT <- data.table(x = runif(100), y = runif(100))
names1 <- names(DT)

As far as I can tell, it's a plain vanilla character vector:

str(names1)
# chr [1:2] "x" "y"

class(names1)
# [1] "character"

dput(names1)
# c("x", "y")

However, this is no ordinary character vector. It's a magic character vector! When I add a new column to my data.table, this vector gets updated!

DT[ , z := runif(100)]
names1
# [1] "x" "y" "z"

I know this has something to do with how := updates by assignment, but this still seems magic to me, as I expect <- to make a copy of the data.table's names.

I can fix this by wrapping the names in c():

library(data.table)
set.seed(42)
DT <- data.table(x = runif(100), y = runif(100))

names1 <- names(DT)
names2 <- c(names(DT))
all.equal(names1, names2)
# [1] TRUE

DT[ , z := runif(100)]
names1
# [1] "x" "y" "z"

names2
# [1] "x" "y"

My question is 2-fold:

  1. Why doesn't names1 <- names(DT) create a copy of the data.table's names? In other instances, we are explicitly warned that <- creates copies, both of data.tables and data.frames.
  2. What's the difference between names1 <- names(DT) and names2 <- c(names(DT))?
Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Update: This is now added in the documentation for ?copy in version 1.9.3. From NEWS:

  1. Moved ?copy to it's own help page, and documented that dt_names <- copy(names(DT)) is necessary for dt_names to be not modified by reference as a result of updating DT by reference (ex: adding a new column by reference). Closes #512. Thanks to Zach for this SO question and user1971988 for this SO question.

Part of your first question makes it a bit unclear to me as to what you really mean about <- operator (at least in the context of data.table), especially the part: In other instances, we are explicitly warned that <- creates copies, both of data.tables and data.frames.

So, before answering your actual question, I'll briefly touch it here. In case of a data.table a <- (assignment) merely is not sufficient for copying a data.table. For example:

DT <- data.table(x = 1:5, y= 6:10)
# assign DT2 to DT
DT2 <- DT # assign by reference, no copy taken.
DT2[, z := 11:15]
# DT will also have the z column

If you want to create a copy, then you've to explicitly mention it using copy command.

DT2 <- copy(DT) # copied content to DT2
DT2[, z := 11:15] # only DT2 is affected

From CauchyDistributedRV, I understand what you mean is the assignment names(dt) <- . that'll result in the warning. I'll leave it as such.


Now, to answer your first question: It seems that names1 <- names(DT) also behaves similarly. I hadn't thought/known about this until now. The .Internal(inspect(.)) command is very useful here:

.Internal(inspect(names1))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
#   @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
#   @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"

.Internal(inspect(names(DT)))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
#   @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
#   @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"

Here, you see that they are pointing to the same memory location @7fc86a851480. Even the truelength of names1 is 100 (which is by default allocated in data.table, check ?alloc.col for this).

truelength(names1)
# [1] 100

So basically, the assignment names1 <- names(dt) seems to happen by reference. That is, names1 is pointing to the same location as dt's column names pointer.

To answer your second question: The command c(.) seems to create a copy as there is no checking as to whether the contents result due to concatenation operation are different. That is, because c(.) operation can change the contents of the vector, it immediately results in a "copy" being made without checking if the contents are modified are not.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...