Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
180 views
in Technique[技术] by (71.8m points)

performance - Count number of observations by group matlab

I have a matlab dataset that looks like this:

year  value
1995    90000
1995    53000
1995    80000
1995    60000
1995    37000
1995    42000
1995    13102
1996    35000
1996    50000
1996    32000
1996    47000
1997    36000
1997    90000
1997    NaN
1997    90000
1997    51500
1997    81000
1998    71000
(...)
2020    68000

These are two separate columns of data.

Now I want to count the number of non-NaN observations in column value between 2010 and 2020 per year i.e. the output should look like:

year count
2010 20
2011 31
(...)
2020 9

If any count is zero, it should show up as zero.

I know I can do it with a very simple loop (example below). But this is very inefficient for a large dataset. I was looking into accumarray, but could not figure out how to do it.

    N = 300;

%Generate years vector
years = round(1996 + (2020-1996) .* (rand(N,1)));
years = sort(years);

% Generate values vector
values = rand(N,1); 
NaN_position = rand(N,1)>.9; %Now put some random NaNs
values(NaN_position) = NaN;

count = 1;
for y=min(years):max(years)
    
    indicator = years == y;
    count_vals(count,1) = sum(not(isnan(values(indicator))));
    count = count + 1;
end
question from:https://stackoverflow.com/questions/65833873/count-number-of-observations-by-group-matlab

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Let the data be defined as:

years = [1995 1995 1995 1995 1995 1995 1995 1996 1996 1996 1996 1997 1997 1997 1997 1997 1997 1998 2020].';
values = [90000 53000 80000 60000 37000 42000 13102 35000 50000 32000 47000 36000 90000 NaN 90000 51500 81000 71000 68000].';
year_min = 1996;
year_max = 1998;

Then:

result_year = year_min:year_max;
result_count = histcounts(years(~isnan(values)), [result_year year_max+.5]);

The term year_max+.5 is needed in the second input of histcounts because, as per the documentation, the last bin includes the right edge.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...