Options for caching / memoization / hashing in R

Question

Welcome To Ask or Share your Answers For Others

Options for caching / memoization / hashing in R

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Options for caching / memoization / hashing in R

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?

As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).

It looks like the options are:

Hashing

digest - provides hashing for arbitrary R objects.

Memoization

memoise - a very simple tool for memoization of functions.
R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.

Caching

hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.

Key/value storage

These are basic options for external storage of R objects.

Checkpointing

cacher - this seems to be more akin to checkpointing.
CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.

Other

Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
The data.table package supports rapid lookups of elements in a data table.

Use case

Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
Memoization of monstrous calculations.

These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.

Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.

Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

Dirk Eddelbuettel: digest - a lot of other packages depend on this.
Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.

Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:33:21+0000

I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.

# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F) 
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath() 
simulate <- function(mean, sd) {
    # 1. Try to load cached data, if already generated
    key <- list(mean, sd)
    data <- loadCache(key)
    if (!is.null(data)) {
        cat("Loaded cached data
")
        return(data);
    }
    # 2. If not available, generate it.
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok
")
    saveCache(data, key=key, comment="simulate()")
    data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))

simulate2 <- function(mean, sd) {
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("Done generating data from scratch
")
    data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages. 
mzs <- addMemoization(simulate2)

data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same 
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)

# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
    for (kk in 1:3) {
        cat(sprintf("Iteration #%d:
", kk))
        res <- evalWithMemoization({
            cat("Evaluating expression...")
            a <- kk
            Sys.sleep(1)
            cat("done
")
            a
        }, key=list(kk=kk))
        # expressions inside 'res' are skipped on the repeated run
        print(res)
        # Sanity checks
        stopifnot(a == kk)
        # Clean up
        rm(a)
    } # for (kk ...)
} # for (ii ...)

Categories

Options for caching / memoization / hashing in R

Options for caching / memoization / hashing in R

Hashing

Memoization

Caching

Key/value storage

Checkpointing

Other

Use case

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags