'German keyword search - look for every possible combination
I'm working on a project where I define some nouns like Haus, Boot, Kampf, ...
and what to detect every version (singular/plurar) and every combination of these words in sentences. For example, the algorithm should return true if a sentences does contain one of : Häuser, Hausboot, Häuserkampf, Kampfboot, Hausbau, Bootsanleger, ...
.
Are you familiar with an algorithm that can do such a thing (preferable in R)? Of course I could implement this manually, but I'm pretty sure that something should already exist.
Thanks!
Solution 1:[1]
you can use stringr
library and the grepl
function as it is done in this example:
> # Toy example text
> text1 <- c(" This is an example where Hausbau appears twice (Hausbau)")
> text2 <- c(" Here it does not appear the name")
> # Load library
> library(stringr)
> # Does it appear "Hausbau"?
> grepl("Hausbau", text1)
[1] TRUE
> grepl("Hausbau", text2)
[1] FALSE
> # Number of "Hausbau" in the text
> str_count(text1, "Hausbau")
[1] 2
Solution 2:[2]
check <- c("Der Häuser", "Das Hausboot ist", "Häuserkampf", "Kampfboot im Wasser", "NotMe", "Hausbau", "Bootsanleger", "Schauspiel")
base <- c("Haus", "Boot", "Kampf")
unlist(lapply(str_to_lower(stringi::stri_trans_general(check, "Latin-ASCII")), function(x) any(str_detect(x, str_to_lower(base)) == T)))
# [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Breaking it down
Note the comment of Roland, you will match false TRUE values in words like "Schauspiel"
You need to get rid of the special characters, you can use stri_trans_general
to translate them to Latin-ASCII
You need to convert your strings to lowercase (i.e. match Boot in Kampfboot)
Then apply over the strings to test and check if they are in the base list, if any of those values is true. You got a match.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | R18 |
Solution 2 |