Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
800 views
in Technique[技术] by (71.8m points)

r - Divide all rows by a reference row, by group

Here is a sample table I'm working with:

n = c(rep("A",3),rep("B",3),rep("C",3))
m = c("X", "Y", "Z", "X", "Y", "Z", "X", "Y", "Z")
s = 1:9 
b = 5:13
c = 20:28
d = c(rep("abc", 9))
df = data.frame(d, n, m, s, b, c) 
df

Below is what the table looks like:

d   n   m   s   b   c
abc A   X   1   5   20
abc A   Y   2   6   21
abc A   Z   3   7   22
abc B   X   4   8   23
abc B   Y   5   9   24
abc B   Z   6   10  25
abc C   X   7   11  26
abc C   Y   8   12  27
abc C   Z   9   13  28

I'll refer to each row as a concatenation of its column n and m values (e.g. AX row, CZ row, etc.) I would like to divide each of the A rows by the AY row, each of the B rows by the BY row, and each of the C rows by the CY row (may not always be Y, sometimes X or Z). I essentially want to rebase the data (columns s, b, and c) by group (where column n is the group), using X, Y, or Z (column m) as the base.

I need columns d, n, and m to remain untouched. If possible, I'd like to do this by referencing X, Y, or Z in the code directly to denote which row will be the base, rather than by [1], [2], or [3] (as they may not always be in the same order, and it's more intuitive to the user). I'm new to R and using dplyr but I haven't been able to figure out a good way of doing this.

Thanks for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using data.table.

library(data.table)

setDT(df)

divselect <- "Y"

set(df, j = "s", value = as.numeric(df[["s"]]))
set(df, j = "b", value = as.numeric(df[["b"]]))
set(df, j = "c", value = as.numeric(df[["c"]]))

The set commands are to avoid an error. The columns currently are integer, but you're going to be making them double. If in your real world example they're already double this won't be necessary.

The value of divselect changes which column rows you're using as your base. You can change this to X or Z as needed.

df[, `:=`(s = s/s[m == divselect],
          b = b/b[m == divselect],
          c = c/c[m == divselect]),
   by = n]

Result:

#      d n m     s         b         c
# 1: abc A X 0.500 0.8333333 0.9523810
# 2: abc A Y 1.000 1.0000000 1.0000000
# 3: abc A Z 1.500 1.1666667 1.0476190
# 4: abc B X 0.800 0.8888889 0.9583333
# 5: abc B Y 1.000 1.0000000 1.0000000
# 6: abc B Z 1.200 1.1111111 1.0416667
# 7: abc C X 0.875 0.9166667 0.9629630
# 8: abc C Y 1.000 1.0000000 1.0000000
# 9: abc C Z 1.125 1.0833333 1.0370370

Followup

I have one question: is there a way to generalize the columns that get rebased? I'd like this code to be able to handle additional numeric columns (more than 3 without calling each out specifically). i.e. Can I define the division to happen to all columns except d, n, and m?

Yes, you can do this by using lapply either inside or outside the data.table.

setDT(df)

divselect <- "Y"

funcnumeric <- function(x) {
  set(df, j = x, value = as.numeric(df[[x]]))
  NULL
}

modcols <- names(df)[!(names(df) %in% c("d", "n", "m"))]

a <- lapply(modcols, funcnumeric)

This replaces the three set commands in the first answer. Instead of specifying each, we use lapply to perform the function on each column that is not d, n, or m. Note that I return NULL to avoid messy function return text; since this is data.table it is all done in place.

funcdiv <- function(x, pos) {
  x/x[pos]
}

df[ , (modcols) := lapply(.SD, 
                          funcdiv, 
                          pos = which(m == divselect)), 
    by = n, 
    .SDcols = modcols]

This is done slightly different than before. Here we create a simple function that will divide a vector by that vector's value a the position specified by the pos parameter. We apply that to each column in .SD, and also pass the pos value as the position where the m column is equal to the value of divselect, in this case it is equal to Y. Since we are specifying by = n both the vector and pos arguments to funcdiv will be determined for each value in n. The parameter .SDcols specifies that we want to lapply this function, which is the same set of columns that we assigned to the variable modcols. We assign all of this back to modcols in place.

Result:

#      d n m     s         b         c
# 1: abc A X 0.500 0.8333333 0.9523810
# 2: abc A Y 1.000 1.0000000 1.0000000
# 3: abc A Z 1.500 1.1666667 1.0476190
# 4: abc B X 0.800 0.8888889 0.9583333
# 5: abc B Y 1.000 1.0000000 1.0000000
# 6: abc B Z 1.200 1.1111111 1.0416667
# 7: abc C X 0.875 0.9166667 0.9629630
# 8: abc C Y 1.000 1.0000000 1.0000000
# 9: abc C Z 1.125 1.0833333 1.0370370 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...