'LDA Returning numbers instead of words from Term Document Matrix
I am trying to use the LDA function to evaluate a corpus of text in R. However, when I do so, it seems to use the row names of the observations rather than the actual words in the corpus. I can't find anything else about this online so I imagine I must be doing something very basic incorrectly.
library(tm)
library(SnowballC)
library(tidytext)
library(stringr)
library(tidyr)
library(topicmodels)
library(dplyr)
#read in data
data <- read.csv('CSV_format_data.csv',sep=',')
#Create corpus/DTM
interviews <- as.matrix(data[,2])
ints.corpus <- Corpus(VectorSource(interviews))
ints.dtm <- TermDocumentMatrix(ints.corpus)
chapters_lda <- LDA(ints.dtm, k = 4, control = list(seed = 5421685))
chapters_lda_td <- tidy(chapters_lda,matrix="beta")
chapters_lda_td
head(ints.dtm$dimnames$Terms)
The 'chapters_lda_td' command outputs
# A tibble: 4,084 x 3
topic term beta
<int> <chr> <dbl>
1 1 1 0.000555
2 2 1 0.00399
3 3 1 0.000614
4 4 1 0.000699
5 1 2 0.0000195
6 2 2 0.000708
7 3 2 0.000731
8 4 2 0.00000155
9 1 3 0.000974
10 2 3 0.0000363
# ... with 4,074 more rows
Note that there are numbers instead of words as there should be in the "term" column. The number of rows matches the number of documents times the number of topics, rather than the number of terms times the number of topics, as it should be. The 'head(ints.dtm$dimnames$Terms)' is to check that there are actually words in the DTM, which there are. The result is:
[1] "aaye" "able" "adjust" "admission" "after" "age"
The data file itself is a pretty standard two-column CSV file with an ID and a block of text, and hasn't given me any problem while doing other text-mining stuff with it and the tm package. Any help would be appreciated, thank you!
Solution 1:[1]
I figured it out! It is because I am using the command
ints.dtm <- TermDocumentMatrix(ints.corpus)
rather than
ints.dtm <- DocumentTermMatrix(ints.corpus)
I guess the ordering of Term and Document switches their dimnames order around, so LDA grabs the wrong one.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | NickCHK |