plyr - How to expand a large dataframe in R

Question

Welcome To Ask or Share your Answers For Others

plyr - How to expand a large dataframe in R

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

plyr - How to expand a large dataframe in R

I have a dataframe

df <- data.frame(
  id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4), 
  date = c("1985-06-19", "1985-06-19", "1985-06-19", "1985-08-01", 
           "1985-08-01", "1990-06-19", "1990-06-19", "1990-06-19", 
           "1990-06-19", "2000-05-12"), 
  spp = c("a", "b", "c", "c", "d", "b", "c", "d", "a", "b"),
  y = rpois(10, 5))

   id       date spp y
1   1 1985-06-19   a 6
2   1 1985-06-19   b 3
3   1 1985-06-19   c 7
4   2 1985-08-01   c 7
5   2 1985-08-01   d 6
6   3 1990-06-19   b 5
7   3 1990-06-19   c 4
8   3 1990-06-19   d 4
9   3 1990-06-19   a 6
10  4 2000-05-12   b 6

I want to expand it so that there is every combination of id and spp and have y = 0 for every combination that is not currently in the dataframe. The dataframe is currently about 100,000 rows and 15 columns. When expanded it would be about 300,000 columns (there are 17 unique values of spp in my actual dataset).

For every value of id the date is the same (e.g. when id = 2, date always = 1985-08-01). In my real dataset all the columns except spp and y can be specified by the id.

I want to end up with something like:

   id       date spp y
   1 1985-06-19   a 6
   1 1985-06-19   b 3
   1 1985-06-19   c 7
   1 1985-06-19   d 0*
   2 1985-08-01   a 0*
   2 1985-08-01   b 0*
   2 1985-08-01   c 7
   2 1985-08-01   d 6
   3 1990-06-19   b 5
   3 1990-06-19   c 4
   3 1990-06-19   d 4
   3 1990-06-19   a 6
   4 2000-05-12   a 0*
   4 2000-05-12   b 6
   4 2000-05-12   c 0*
   4 2000-05-12   d 0*

Indicate added rows

I will likely have to do this in the future with potentially much larger data frames, so a quick, efficient (time and memory) way to do this would be appreciated but any solution would satisfy me. I figure there should be ways to use the dplyr, data.table, or reshape packages but I'm not very familiar with any of them. I'm not sure if it would be easiest to expand just rows id, spp, and y, then do a left_join() or merge() to recombine date (and all the other variables in my real dataframe) based on id?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:08:52+0000

expand.grid is a useful function here,

mergedData <- merge(
    expand.grid(id = unique(df$id), spp = unique(df$spp)),
    df, by = c("id", "spp"), all =T)

mergedData[is.na(mergedData$y), ]$y <- 0

mergedData$date <- rep(levels(df$date),
                       each = length(levels(df$spp)))

Since you're not actually doing anything to subsets of the data I don't think plyr will help, maybe more efficient ways with data.table.

Categories

plyr - How to expand a large dataframe in R

plyr - How to expand a large dataframe in R

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags