'Getting word count of doc/docx files in R
I have a stream of doc/docx documents that I need to get the word count of.
The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R.
This is what I tried:
library(textreadr)
library(stringr)
myDocx = read_docx(myDocxFile)
docText = str_c(myDocx , collapse = " ")
wordCount = str_count(test, "\\s+") + 1
Unfortunately, wordCount
is NOT what MS Word suggests.
For example, I noticed that MS Word counts the numbers in numbered lists, whereas textreadr
does not even import them.
Is there a workaround? I don't mind trying something in Python, too, although I'm less experienced there.
Any help would be greatly appreciated.
Solution 1:[1]
I tried reading the docx files with a different library (the officer
) and, even though it doesn't agree 100%, it does significantly better this time.
Another small fix would be to copy MS Word's strategy on what is a Word and what isn't. The naive method of counting all spaces can be improved by ignoring the "En Dash" (U+2013) character as well.
Here is my improved function:
getDocxWordCount = function(docxFile) {
docxObject = officer::read_docx(docxFile)
myFixedText = as.data.table(officer::docx_summary(docxObject))[nchar(str_trim(text)) > 1, str_trim(text)]
wordBd = sapply(as.list(myFixedText), function(z) 1 + str_count(z, "\\s+([\u{2013}]\\s+)?"))
return(sum(wordBd))
}
This still has a weakness that prevents 100% accuracy:
The officer
library doesn't read list separators (like bullets or hyphens), but MS Word considers those as words. So in any list, this function currently returns X words less where X is the number of listed items. I haven't experimented too much with the attributes of the docxObject
, but if it somehow holds the number of listed items, then a definite improvement can be made.
Solution 2:[2]
This should be able to be done using the tidytext
package in R.
library(textreadr)
library(tidytext)
library(dplyr)
#read in word file without password protection
x <- read_docx(myDocxFile)
#convert string to dataframe
text_df <-tibble(line = 1:length(x),text = x)
#tokenize dataframe to isolate separate words
words_df <- text_df %>%
unnest_tokens(word,text)
#calculate number of words in passage
word_count <- nrow(words_df)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | GerasimosPanagiotakopoulos |
Solution 2 | Ant |