'Convert string into binary vector in R

I'm trying to clusterize a set of journals by descriptors and I've been thinking of turning descriptors into a binary vector instead of using string distances (that I've been using so far) to avoid issues like matching "Catalysis" and "Analysis" or matching long strings for (undesired) partial matches.

To implement this idea, I've separated every descriptor that journals may present into a set of 266 strings(isolated_cat) in alphabetic order.

dput(head(isolated_cat))
c("Accounting", "AcousticsUltrasonics", "AdvancedSpecializedNursing", 
"AerospaceEngineering", "Aging", "AgriculturalBiologicalSciences"
)

For each journal in my dataframe, I have a column with a set of descriptors, eg

journals_STEM$Categories4dist[1]
[1] "Biomaterials ElectronicOpticalMagneticMaterials Energy MaterialsChemistry SurfacesCoatingsFilms"

The output I'm expecting is a 266 long vector with 0 and 1 for each category in isolated_cat indicating whether the descriptors include that word or not (afterwards I was thinking of testing PCA and different clustering methods to separate journals into groups).

First, I tried

as.numeric(isolated_cat %in% aux$Categories4dist[i])

which obviously (I noticed later) only works for journals defined by a single category. I've been trying different blends of grep, but I haven't been lucky. Is there any straight way of achieving this? The only solutions I have found thus far are way too convoluted and I think I'm missing something obvious.



Solution 1:[1]

Here's a base R option with lapply and grepl -

journals_STEM[isolated_cat] <- lapply(isolated_cat, function(x) 
            +(grepl(x, journals_STEM$Categories4dist, ignore.case = TRUE)))

The above would also match with a substring meaning "at" would match with "cat". If you need an exact match use word boundary (\\b).

journals_STEM[isolated_cat] <- lapply(paste0('\\b', isolated_cat, '\\b'), 
      function(x) +(grepl(x, journals_STEM$Categories4dist, ignore.case = TRUE)))

Solution 2:[2]

Sth. like:

library(stringr)

isolatedcat <- c("Accounting", "AcousticsUltrasonics", "AdvancedSpecializedNursing", "AerospaceEngineering", "Aging", "AgriculturalBiologicalSciences", 'Biomaterials')


Categories4dist <- str_split('Biomaterials ElectronicOpticalMagneticMaterials Energy MaterialsChemistry SurfacesCoatingsFilms', ' ', simplify = TRUE)

as.data.frame(sapply(isolatedcat, function(x) as.numeric(str_detect(x, Categories4dist))))

which gives:

  Accounting AcousticsUltrasonics AdvancedSpecializedNursing
1          0                    0                          0
2          0                    0                          0
3          0                    0                          0
4          0                    0                          0
5          0                    0                          0
  AerospaceEngineering Aging AgriculturalBiologicalSciences Biomaterials
1                    0     0                              0            1
2                    0     0                              0            0
3                    0     0                              0            0
4                    0     0                              0            0
5                    0     0                              0            0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ronak Shah
Solution 2 deschen