'Tokenization of Compound Words not Working in Quanteda

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.

This is the subset of the dataset I'm using as a reproducible example:

test_cluster <- speeches_subset %>%
  filter(grepl('Schwester Agnes',
                speechContent,
                ignore.case = TRUE))

test_corpus <- corpus(test_cluster,
                      docid_field = "id",
                      text_field = "speechContent")

Here, test_cluster contains six observations of 12 variables, that is, six rows in which the column speechContent contains the compound word "Schwester Agnes". test_corpus transforms the underlying data into a quanteda corpus object.

When I then run the following code, I would expect, first, the content of the speechContent variables to be tokenized, and due to tokens_compound, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound(), but I'm not sure... Any help would be greatly appreciated!

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("Schwester Agnes"))

test_kwic <- kwic(test_tokens,
                  pattern = "Schwester Agnes",
                  window = 5)

EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id=1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = c("stack", "overflow"))

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)


Solution 1:[1]

You need to apply phrase("stack overflow") and set concatenator = " " in tokens_compound().

require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1

speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", 
           "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", 
           "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id = 1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)
test_kwic
#> Keyword-in-context with 2 matches.                                                                             
#>  [1, 29] for example is the word | stack overflow | However there are so many
#>  [2, 24]     but at the very end | stack overflow |

Created on 2022-05-06 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1