I'm having a difficult time understanding how to calculate a mean when the data is in long format (i.e: each observation has its own row).
For example, I have joined a database of surgeries and of transfusions (receiving blood products) based on social id and date of surgery. To this I add a column called "transfused" which is a binary switch, if the row contains any transfusion (i.e. blood, plasma, platelets = 1, then transfused = 1)
Each surgery and each transfusion is it's own row in the data, which causes there to be many rows per id, making mean calculations incorrect.
For example, if we have a simple example with a dataset containing just 2 actual surgeries, where one of them receive 50 transfusions, and in one of them no transfusions; In the combined dataset, I'll have 51 full rows (due to R recycling).
In reality 50% of the surgeries required a transfusion, but in the above example the mean-calc will show that 50/51 surgeries were transfused.
Where am I going wrong?
I realize R is doing exactly what I'm telling it to do, but I can't figure out how I should proceed so that the "transfused" flag is only set or counted once per unique id and date in the end-calculation
library(tidyverse)
surgeries <- tibble(
id = 1:10,
operation = c("App", "App", "App", "App", "App", "App", "App", "App", "App", "App"),
date = c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05")
)
transfusions <- tibble(
id = c(1,1,1,2,2,3,4,8,8,8),
type = c("Blood", "Blood","Plasma","Blood","Platelets","Blood","Blood","Plasma","Blood","Blood"),
date = c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05")
)
combined <- surgeries %>%
left_join(transfusions, by = c("id", "date"))
combined <- combined %>%
mutate(
transfused = if_else((type == "Blood" | type == "Plasma" | type == "Platelets"), 1, 0, missing=0)
)
aggregate(combined, by=list(Operation = combined$operation), mean)
The desired outcome for the mean should be 2/10 in the above example, but becomes 3/11 because of the format of each row being an observation
question from:
https://stackoverflow.com/questions/66046512/counting-correct-mean-in-long-format-only-once-per-id-in-r