'R: Gibberish detection using Markov model (how to adjust for the text length)
First, I create a table with frequencies of two-letter-combinations out of a German text file:
# Load a German language text file
text <- read_tsv("http://www.reduts.net/deutsch.txt")
colnames(text) <- c("id", "text")
text <- paste(text$text, collapse = " ")
# Calculate all two-letters-combinations
tokens_char <- function(str, window = 2) {
str <- stringi::stri_replace_all_regex(str, "\\W", "")
str <- tolower(str)
win <- window - 1
len1 <- seq_len(nchar(str) - win)
stringi::stri_sub(str, from = len1, to = len1 + win)
}
This creates a lookup-table containing two columns: 1. the two-letter-combination and 2. the frequency of all combinations appearing in the text:
lookuptable <- tibble(
token = tokens_char(text, window = 2)
) %>% count(token, sort = TRUE) %>%
mutate(token2 = token) %>%
separate(token, into = c("first", "second"), sep = 1) %>%
group_by(first) %>%
mutate(total = sum(n),
freq = n / total) %>%
ungroup() %>%
mutate(token = token2,
token2 = NULL,
first = NULL,
second = NULL,
total = NULL) %>%
select(token, freq)
> lookuptable
# A tibble: 1,522 x 2
token freq
<chr> <dbl>
1 en 0.233
2 er 0.225
3 ch 0.861
4 de 0.446
5 ei 0.127
6 te 0.302
7 nd 0.186
8 in 0.228
9 ie 0.209
10 ge 0.494
# ... with 1,512 more rows
Then, I have a function that calculates the product of the frequencies of all 2-letter-combinations for a given text. For example for the word "test" I lookup the probabilities of "te" "es" and "st". These probabilities are then multiplied: P("te") * P("es") * P("st"):
lookup_text <- function(text = ""){
df <- data.frame(token = tokens_char(text, window = 2)) %>%
left_join(lookuptable, by = "token")
# Return product of all probabilities
return(prod(df$freq))
}
Now, I can easily check how probable a given text is gibberish or real.
> lookup_text("test")
[1] 0.004262462
There is just one major drawback: Obviously the value I get is heavily depending on the length of the string I want to check. So my question is: How can I fix this?
In this threat (Is there any way to detect strings like putjbtghguhjjjanika?) someone writes: "Then normalize by the length of the query." But how can this be done? Thanks for your help!
Solution 1:[1]
I think you are basically looking for the mean instead of the product:
lookup_text <- function(text = ""){
df <- data.frame(token = tokens_char(text, window = 2)) %>%
left_join(lookuptable, by = "token")
return(mean(df$freq))
}
lookup_text(text = "Test")
#> [1] 0.1889944
lookup_text(text = "Tests")
#> [1] 0.1582075
lookup_text(text = "Testkandidaten")
#> [1] 0.1540773
lookup_text(text = "Quer")
#> [1] 0.397371
I included the last example because I noticed some unusual combinations score very high values in your lookup table, for example, "qu". But that is not part of the question.
One problem you need to address is what happens if a letter combination is not in your lookup table. Currently, this results in NA
. I would argue the probability should be 0 in this case:
lookup_text <- function(text = ""){
df <- data.frame(token = tokens_char(text, window = 2)) %>%
left_join(lookuptable, by = "token")
return(mean(df$freq, na.rm = TRUE) * !any(is.na(df$freq)))
}
lookup_text(text = "qjbstz")
#> [1] 0
Created on 2022-04-22 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | JBGruber |