Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
900 views
in Technique[技术] by (71.8m points)

group by - Counting correct mean in long format only once per id in R

I'm having a difficult time understanding how to calculate a mean when the data is in long format (i.e: each observation has its own row).

For example, I have joined a database of surgeries and of transfusions (receiving blood products) based on social id and date of surgery. To this I add a column called "transfused" which is a binary switch, if the row contains any transfusion (i.e. blood, plasma, platelets = 1, then transfused = 1)

Each surgery and each transfusion is it's own row in the data, which causes there to be many rows per id, making mean calculations incorrect.

For example, if we have a simple example with a dataset containing just 2 actual surgeries, where one of them receive 50 transfusions, and in one of them no transfusions; In the combined dataset, I'll have 51 full rows (due to R recycling).

In reality 50% of the surgeries required a transfusion, but in the above example the mean-calc will show that 50/51 surgeries were transfused.

Where am I going wrong? I realize R is doing exactly what I'm telling it to do, but I can't figure out how I should proceed so that the "transfused" flag is only set or counted once per unique id and date in the end-calculation

library(tidyverse)

surgeries <- tibble(
  id     = 1:10,
  operation  = c("App", "App", "App", "App", "App", "App", "App", "App", "App", "App"),
  date   = c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05")
)

transfusions <- tibble(
  id     = c(1,1,1,2,2,3,4,8,8,8),
  type  = c("Blood", "Blood","Plasma","Blood","Platelets","Blood","Blood","Plasma","Blood","Blood"),
  date   = c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05")
)

combined <- surgeries %>% 
  left_join(transfusions, by = c("id", "date"))

combined <- combined %>% 
  mutate(
    transfused = if_else((type == "Blood" | type == "Plasma" | type == "Platelets"), 1, 0, missing=0)
  )

aggregate(combined, by=list(Operation = combined$operation), mean)

The desired outcome for the mean should be 2/10 in the above example, but becomes 3/11 because of the format of each row being an observation

question from:https://stackoverflow.com/questions/66046512/counting-correct-mean-in-long-format-only-once-per-id-in-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Patient ID = 1 had one surgery and two blood transfusions, so in your final combined there is a duplicate row. Since you are only counting whether a patient had any transfusion (yes or no) and not the number, remove the duplicate before you do the calculation:

combined %>%
  distinct() %>%
  group_by(operation) %>%
  summarize(mean_transfusions = mean(transfused))

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 1 x 2
  operation mean_transfusion
  <chr>                <dbl>
1 App                    0.2

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...