'Problems with TermDocumentMatrix function in R

I'm trying to create a TermDocumentMatrix using tm package, but seem to have encountered difficulties.

The input:

trainDF<-as.matrix(list("I'm going home", "trying to fix this", "when I go home"))

Goal - creating a TDM from the input: (not all controls parameters listed below)

control <- list(
    weight= weightTfIdf, 
    removeNumbers=TRUE, 
    removeStopwords=TRUE, 
    removePunctuation=TRUE,    
    stemWords=TRUE, 
    maxWordLength=maxWordL,
    bounds=list(local=c(minDocFreq, maxDocFreq))
)

tdm<- TermDocumentMatrix(Corpus(DataframeSource(trainDF)),control = control)

The error I get:

Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

And the tdm object is empty. Any ideas?



Solution 1:[1]

The error suggests something is wrong with your choice of minimum and maximum document frequency in the bounds. For example, the following works:

control=list(weighting = weightTfIdf,
             removeNumbers=TRUE, 
             removeStopwords=TRUE, 
             removePunctuation=TRUE, 
             bounds=list(local=c(1,3)))
tdm<- TermDocumentMatrix(Corpus(DataframeSource(trainDF)), control=control)

Note that in the latest versions of TM, To specify a weighting you need to use weighting = weightTfIdf rather than weight = weightTfIdf. Similarly, you should use stemming=TRUE in your control list to stem words. I'm not sure that maxWordLength is an option currently. TM will silently ignore invalid options in the control list, so you won't know that something is wrong until you go back to inspect the matrix.

Solution 2:[2]

The only way the below code works for me is if I use VCorpus instead of Corpus. Otherwise I get the custom functions are ignored message specifically if I use weighting = weightTfIdf, but works ok if I delete it.

Here is working code with TF-IDF weighting:

control=list(weighting = weightTfIdf,
             removeNumbers=TRUE, 
             removeStopwords=TRUE, 
             removePunctuation=TRUE, 
             bounds=list(local=c(1,3)))
tdm<- TermDocumentMatrix(VCorpus(DataframeSource(trainDF)), control=control)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ryan Walker
Solution 2 Katya