Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
247 views
in Technique[技术] by (71.8m points)

na - r split-apply-combine problems

I'm new to r and have a large data.frame (906 rows), and I want to (row?) split the data.frame by the first column (entries associated with the same name are together) before I apply multiple descriptive statistics (mean, standard deviation, standard error/variance, 25% and 75% confidence intervals, min, max, and median) to the rest of the columns. The amount of rows associated with each species is not the same, so it's uneven/unbalanced splits. There are lots of na's scattered in the "par" columns (every row has at least 1 entry for the columns) but I just want to ignore/skip over the na's not delete/omit the row.Heres a picture of my initial data.frame -column names are not the actual column names I'm using

I want my final output to show: a column for the name, a column for the descriptive stat, and a column of the results of the descriptive statistic (one column for each par).I've included a picture of what I want the table output to look like, if it's possible (values in par columns aren't actually the calculated stats I just put random stuff in to fill the frame) Everything I've tried so far, hasn't worked. Again, fairly new too r and I'm not really sure what I'm doing, please help.

question from:https://stackoverflow.com/questions/65944066/r-split-apply-combine-problems

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Often you can find suitable data for your reproducible example by looking at what comes with R (data() will show a list of data sets and brief descriptions). For example, the iris data set is similar to yours except that the species name is the last column:

data(iris)
iris <- iris[, c(5, 1:4)]
iris.splt <- split(iris[, 2:5], iris[, 1])

Now we have loaded the data, moved the last column to the first position, and split the dataset by species into 3 data frames that are stored in a single list called iris.splt. The species name is the name of each part of the list and only the data are stored in the data frame for that list part. Now you need to write a function that computes the statistics you need. Here is an example based on the picture you provided, but you will probably need to change it:

stats <- function(x) {
    quant=as.matrix(quantile(x, na.rm=TRUE))
    mean=mean(x, na.rm=TRUE)
    sd=sd(x, na.rm=TRUE)
    var=var(x, na.rm=TRUE)
    return(rbind(quant, mean, sd, var))
}

This computes the statistics for a single column. We need to run the function on each column of each part of the list using the lapply function twice and then a third time to combine the columns back together:

iris.stats <- lapply(iris.splt, function(x) lapply(x, stats))
iris.dfs <- lapply(iris.stats, data.frame)
iris.dfs
# $setosa
#      Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0%         4.3000      2.3000      1.00000     0.10000
# 25%        4.8000      3.2000      1.40000     0.20000
# 50%        5.0000      3.4000      1.50000     0.20000
# 75%        5.2000      3.6750      1.57500     0.30000
# 100%       5.8000      4.4000      1.90000     0.60000
# mean       5.0060      3.4280      1.46200     0.24600
# sd         0.3525      0.3791      0.17366     0.10539
# var        0.1242      0.1437      0.03016     0.01111
# 
# $versicolor
#      Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0%         4.9000     2.00000       3.0000     1.00000
# 25%        5.6000     2.52500       4.0000     1.20000
# 50%        5.9000     2.80000       4.3500     1.30000
# 75%        6.3000     3.00000       4.6000     1.50000
# 100%       7.0000     3.40000       5.1000     1.80000
# mean       5.9360     2.77000       4.2600     1.32600
# sd         0.5162     0.31380       0.4699     0.19775
# var        0.2664     0.09847       0.2208     0.03911
# 
# $virginica
#      Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0%         4.9000      2.2000       4.5000     1.40000
# 25%        6.2250      2.8000       5.1000     1.80000
# 50%        6.5000      3.0000       5.5500     2.00000
# 75%        6.9000      3.1750       5.8750     2.30000
# 100%       7.9000      3.8000       6.9000     2.50000
# mean       6.5880      2.9740       5.5520     2.02600
# sd         0.6359      0.3225       0.5519     0.27465
# var        0.4043      0.1040       0.3046     0.07543

You will have to decide how you want to use this list or if you want to combine it back into a single data frame, but this should get you started.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...