Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
84 views
in Technique[技术] by (71.8m points)

r - Categorizing and grouping data

I am having difficulty finding the right code for this question: Fields of expertise are a mess - how can I cluster them into summarized fields:

  • Business = Finance, HR, Sales, Marketing, Business development
  • Engineering = Robotics, Engineering, IT support, DevOps, Data Science

The expected result is:

col1(Inputs)                                         col2(Result)
Finance, HR, Robotics, Engineering, Sales            Business, Engineering
DevOps, Data Science, Marketing, IT Support          Business, Engineering
UI/UX, HR, Graphic Design                            Other, Business
Business development                                 Business

In other words, the result should not be like this:

col1(Inputs)                                   col2(Result)
Finance, HR, Robotics, Engineering, Sales      Biz, Biz, Engineer, Engineer, Biz
DevOps, Data Science, Marketing, IT support    Engineer, Engineer, Biz, Engineer
UI/UX, HR, Graphic Design                      Other, Biz, Other

Thank you very much for the answer. I'm thinking about having 2 FOR loops, but then I am stuck in defining the variables i,j.

question from:https://stackoverflow.com/questions/66047284/categorizing-and-grouping-data

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The journey.

Data,

dat <- read.table(header = TRUE, sep = "|", text = "
Inputs                                      | Result
Finance, HR, Robotics, Engineering, Sales   | Business, Engineering
DevOps, Data Science, Marketing, IT Support | Business, Engineering
UI/UX, HR, Graphic Design                   | Other, Business
Business development                        | Business")
dat[] <- lapply(dat, trimws)
dat
#                                        Inputs                Result
# 1   Finance, HR, Robotics, Engineering, Sales Business, Engineering
# 2 DevOps, Data Science, Marketing, IT Support Business, Engineering
# 3                   UI/UX, HR, Graphic Design       Other, Business
# 4                        Business development              Business

BTW: I would generally recommend against the structure of comma-separated values within a frame column: any time you want to look at individual values, you need to parse/separate the fields, do something, and then recombine them. While this is not extremely difficult, it is inefficient, and there can always be problems.

Now, the translation from the left to the right:

fields <- list(
  Business = c("Finance", "HR", "Sales", "Marketing", "Business development"),
  Engineering = c("Robotics", "Engineering", "IT Support", "DevOps", "Data Science")
)
inv_fields <- setNames(rep(names(fields), lengths(fields)), unlist(fields))
inv_fields
#              Finance                   HR                Sales            Marketing Business development 
#           "Business"           "Business"           "Business"           "Business"           "Business" 
#             Robotics          Engineering           IT support               DevOps         Data Science 
#        "Engineering"        "Engineering"        "Engineering"        "Engineering"        "Engineering" 

I first defined fields in that way because I think it's easier to see and maintain. However, for the trick further down (indexing on it), I need the name of each element to be the left (career) and the value to be the right (field). We don't use fields itself.

First attempt, I'll split the strings and then run the indexing:

lapply(strsplit(dat$Inputs, ","), function(z) inv_fields[trimws(z)])
# [[1]]
#       Finance            HR      Robotics   Engineering         Sales 
#    "Business"    "Business" "Engineering" "Engineering"    "Business" 
# [[2]]
#        DevOps  Data Science     Marketing          <NA> 
# "Engineering" "Engineering"    "Business"            NA 
# [[3]]
#       <NA>         HR       <NA> 
#         NA "Business"         NA 
# [[4]]
# Business development 
#           "Business" 

Unfortunately, we lose IT Support, because your comment had it spelled as IT support (lower-case "s"). While this one thing is an easy fix, I feel it's possible (likely, even) that there will be case differences that you want to ignore. So we'll deal with the lower-case version of everything (during the lookup):

# redefine, using lower-case
inv_fields_lc <- setNames(rep(names(fields), lengths(fields)), tolower(unlist(fields)))
lapply(strsplit(dat$Inputs, ","), function(z) inv_fields_lc[tolower(trimws(z))])
# [[1]]
#       finance            hr      robotics   engineering         sales 
#    "Business"    "Business" "Engineering" "Engineering"    "Business" 
# [[2]]
#        devops  data science     marketing    it support 
# "Engineering" "Engineering"    "Business" "Engineering" 
# [[3]]
#       <NA>         hr       <NA> 
#         NA "Business"         NA 
# [[4]]
# business development 
#           "Business" 

Now we need to address the missing values and replace them with "Other", and reduce duplicate entries:

lapply(strsplit(dat$Inputs, ","), function(z) {
  out <- inv_fields_lc[tolower(trimws(z))]
  out[is.na(out)] <- "Other"
  unique(out)
})
# [[1]]
# [1] "Business"    "Engineering"
# [[2]]
# [1] "Engineering" "Business"   
# [[3]]
# [1] "Other"    "Business"
# [[4]]
# [1] "Business"

We can then combine (paste) these and add them as a column to the original frame. (Again, I don't like comma-separated strings like this in frames, it might be more useful to use list-columns, but it depends heavily on the rest of your processing.)

dat$Result2 <- sapply(strsplit(dat$Inputs, ","), function(z) {
  out <- inv_fields_lc[tolower(trimws(z))]
  out[is.na(out)] <- "Other"
  paste(sort(unique(out)), collapse = ", ")
})
dat
#                                        Inputs                Result               Result2
# 1   Finance, HR, Robotics, Engineering, Sales Business, Engineering Business, Engineering
# 2 DevOps, Data Science, Marketing, IT Support Business, Engineering Business, Engineering
# 3                   UI/UX, HR, Graphic Design       Other, Business       Business, Other
# 4                        Business development              Business              Business

(I added a sort too, not sure if that's desired.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...