Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
289 views
in Technique[技术] by (71.8m points)

r - assigning by reference into loaded package datasets

I am in the process of creating a package that uses a data.table as a dataset and has a couple of functions which assign by reference using :=.

I have built a simple package to demonstrate my problem

 library(devtools)
 install_github('foo','mnel')

It contains two functions

foo <- function(x){
  x[, a := 1]
}
fooCall <- function(x){
  eval(substitute(x[, a :=1]),parent.frame(1))
} 

and a dataset (not lazy loaded) DT, created using

DT <- data.table(b = 1:5)
save(DT, file = 'data/DT.rda')

When I install this package, my understanding is that foo(DT) should assign by reference within DT.

 library(foo)
 data(DT)
 foo(DT)
   b a
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1

# However this has not assigned by reference within `DT`

DT
   b
1: 1
2: 2
3: 3
4: 4
5: 5

If I use the more correct

tracmem(DT)
DT <- foo(DT)
# This works without copying
DT 
 b a
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
untracemem(DT)

If I use eval and substitute within the function

fooCall(DT)
   b a
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
# it does assign by reference 
DT
   b a
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1

Should I stick with

  1. DT <- foo(DT) or the eval/substitute route, or
  2. Is there something I'm not understanding about how data loads datasets, even when not lazy?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This has nothing to do with datasets or locking -- you can reproduce it simply using

DT<-unserialize(serialize(data.table(b = 1:5),NULL))
foo(DT)
DT

I suspect it has to do with the fact that data.table has to re-create the extptr inside the object on the first access on DT, but it's doing so on a copy so there is no way it can share the modification with the original in the global environment.


[From Matthew] Exactly.

DT<-unserialize(serialize(data.table(b = 1:3),NULL))
DT
   b
1: 1
2: 2
3: 3
DT[,newcol:=42]
DT                 # Ok. DT rebound to new shallow copy (when direct)
   b newcol
1: 1     42
2: 2     42
3: 3     42

DT<-unserialize(serialize(data.table(b = 1:3),NULL))
foo(DT)
   b a
1: 1 1
2: 2 1
3: 3 1
DT                 # but not ok when via function foo()
   b
1: 1
2: 2
3: 3


DT<-unserialize(serialize(data.table(b = 1:3),NULL))
alloc.col(DT)      # alloc.col needed first
   b
1: 1
2: 2
3: 3
foo(DT)
   b a
1: 1 1
2: 2 1
3: 3 1
DT                 # now it's ok
   b a
1: 1 1
2: 2 1
3: 3 1

Or, don't pass DT into the function, just refer to it directly. Use data.table like a database: a few fixed name tables in .GlobalEnv.

DT <- unserialize(serialize(data.table(b = 1:5),NULL))
foo <- function() {
   DT[, newcol := 7]
}
foo()
   b newcol
1: 1      7
2: 2      7
3: 3      7
4: 4      7
5: 5      7
DT              # Unserialized data.table now over-allocated and updated ok.
   b newcol
1: 1      7
2: 2      7
3: 3      7
4: 4      7
5: 5      7

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...