Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
154 views
in Technique[技术] by (71.8m points)

r - Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:

i   birth   gender  race    region  urban   wage    year  educ
1   58      2        3      1       1       4620    1979   12
1   58      2        3      1       1       4620    1980   12
1   58      2        3      2       1       4620    1981   12
1   58      2        3      2       1       4700    1982   12

.....

i   birth   gender  race    region  urban   wage    year  educ
45   65      2        3      3       1      NA       1979   10
45   65      2        3      3       1      NA       1980   10
45   65      2        3      4       2      11500    1981   10
45   65      2        3      1       1      11500    1982   10

i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.

How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)

A simplified version of expected output:

i  changes in region
1   1
...
45  2

i  changes in urban
1   0
...
45  2

I would then like to sum up the number of changes for region and urban.

I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.

Here's a part of the data for i=4.

i   birth gender    race    region  urban   wage    year    educ
4   62      2        3        1      1       NA     1979    9
4   62      2        3        NA     NA      NA     1980    9
4   62      2        3        4      1       0      1981    9
4   62      2        3        4      1       1086   1982    9
4   62      2        3        1      1       70     1983    9
4   62      2        3        1      1       0      1984    9
4   62      2        3        1      1       0      1985    9
4   62      2        3        1      1       7000   1986    9
4   62      2        3        1      1      17500   1987    9
4   62      2        3        1      1      21320   1988    9
4   62      2        3        1      1      21760   1989    9
4   62      2        3        1      1         0    1990    9
4   62      2        3        1      1         0    1991    9
4   62      2        3        1      1      30500   1992    9
4   62      2        3        1      1      33000   1993    9
4   62      2        3       NA     NA        NA    1994    9
4   62      2        3        4      1      35000   1996    9

Here, output should be:

i change_reg   change_urban
4  3            0
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is something I hope will get your closer to what you need.

First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.

Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.

Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.

library(dplyr)
library(tidyr)

df_tot <- df %>%
  drop_na(region, urban) %>%
  group_by(i) %>%
  mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
         urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
  summarize(tot_region = sum(reg_change),
            tot_urban = sum(urban_change))

# A tibble: 3 x 3
      i tot_region tot_urban
  <int>      <dbl>     <dbl>
1     1          1         0
2     4          3         0
3    45          2         2

Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)

colSums(df_tot[-1])

tot_region  tot_urban 
         6          2 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...