The journey.
Data,
dat <- read.table(header = TRUE, sep = "|", text = "
Inputs | Result
Finance, HR, Robotics, Engineering, Sales | Business, Engineering
DevOps, Data Science, Marketing, IT Support | Business, Engineering
UI/UX, HR, Graphic Design | Other, Business
Business development | Business")
dat[] <- lapply(dat, trimws)
dat
# Inputs Result
# 1 Finance, HR, Robotics, Engineering, Sales Business, Engineering
# 2 DevOps, Data Science, Marketing, IT Support Business, Engineering
# 3 UI/UX, HR, Graphic Design Other, Business
# 4 Business development Business
BTW: I would generally recommend against the structure of comma-separated values within a frame column: any time you want to look at individual values, you need to parse/separate the fields, do something, and then recombine them. While this is not extremely difficult, it is inefficient, and there can always be problems.
Now, the translation from the left to the right:
fields <- list(
Business = c("Finance", "HR", "Sales", "Marketing", "Business development"),
Engineering = c("Robotics", "Engineering", "IT Support", "DevOps", "Data Science")
)
inv_fields <- setNames(rep(names(fields), lengths(fields)), unlist(fields))
inv_fields
# Finance HR Sales Marketing Business development
# "Business" "Business" "Business" "Business" "Business"
# Robotics Engineering IT support DevOps Data Science
# "Engineering" "Engineering" "Engineering" "Engineering" "Engineering"
I first defined fields
in that way because I think it's easier to see and maintain. However, for the trick further down (indexing on it), I need the name of each element to be the left (career) and the value to be the right (field). We don't use fields
itself.
First attempt, I'll split the strings and then run the indexing:
lapply(strsplit(dat$Inputs, ","), function(z) inv_fields[trimws(z)])
# [[1]]
# Finance HR Robotics Engineering Sales
# "Business" "Business" "Engineering" "Engineering" "Business"
# [[2]]
# DevOps Data Science Marketing <NA>
# "Engineering" "Engineering" "Business" NA
# [[3]]
# <NA> HR <NA>
# NA "Business" NA
# [[4]]
# Business development
# "Business"
Unfortunately, we lose IT Support
, because your comment had it spelled as IT support
(lower-case "s"). While this one thing is an easy fix, I feel it's possible (likely, even) that there will be case differences that you want to ignore. So we'll deal with the lower-case version of everything (during the lookup):
# redefine, using lower-case
inv_fields_lc <- setNames(rep(names(fields), lengths(fields)), tolower(unlist(fields)))
lapply(strsplit(dat$Inputs, ","), function(z) inv_fields_lc[tolower(trimws(z))])
# [[1]]
# finance hr robotics engineering sales
# "Business" "Business" "Engineering" "Engineering" "Business"
# [[2]]
# devops data science marketing it support
# "Engineering" "Engineering" "Business" "Engineering"
# [[3]]
# <NA> hr <NA>
# NA "Business" NA
# [[4]]
# business development
# "Business"
Now we need to address the missing values and replace them with "Other"
, and reduce duplicate entries:
lapply(strsplit(dat$Inputs, ","), function(z) {
out <- inv_fields_lc[tolower(trimws(z))]
out[is.na(out)] <- "Other"
unique(out)
})
# [[1]]
# [1] "Business" "Engineering"
# [[2]]
# [1] "Engineering" "Business"
# [[3]]
# [1] "Other" "Business"
# [[4]]
# [1] "Business"
We can then combine (paste
) these and add them as a column to the original frame. (Again, I don't like comma-separated strings like this in frames, it might be more useful to use list-columns, but it depends heavily on the rest of your processing.)
dat$Result2 <- sapply(strsplit(dat$Inputs, ","), function(z) {
out <- inv_fields_lc[tolower(trimws(z))]
out[is.na(out)] <- "Other"
paste(sort(unique(out)), collapse = ", ")
})
dat
# Inputs Result Result2
# 1 Finance, HR, Robotics, Engineering, Sales Business, Engineering Business, Engineering
# 2 DevOps, Data Science, Marketing, IT Support Business, Engineering Business, Engineering
# 3 UI/UX, HR, Graphic Design Other, Business Business, Other
# 4 Business development Business Business
(I added a sort
too, not sure if that's desired.)