'Compare two large text documents and calculate some sort of percentage similarity between them
I am looking for a tool (program, website or R package) that can compare two text documents and calculate some sort of percentage similarity between them, for free. The text documents are in .pdf, but it the solution works with plain text, .pdf or .docx files I'd be happy with that as well. There are loads of similar questions on StackOverflow, but I couldn't find any that specifically addressed my case: comparing only two documents and producing some sort of statistic that tells me something about the similarity of the texts (preferably a percentage). I realise that this statistic will depend on what algorithm/technique is used to calculate the similarity between the document. This is fine. I just want some way to quantify how similar the two documents are.
Background I recently handed in my doctoral thesis. Before doing so, I paid a proof reader to go over the text. They returned a document where nearly every single sentence was changed. While some suggestions were improvements, there were simply too many suggestions and edits for me to be able to go through. In the end, I only accepted a tiny minority of them. I was a bit sore after wasting the time and money on it. It really felt like the proof reader just rewrote my thesis. It made me wonder how similar the two outputs really would be: my finished thesis and the thesis with all of their suggestions and edits accepted. I tried some websites and programs, but nothing quite did what I wanted: Draftable, Word and Adobe Acrobat Pro both returned a list of changes/differences, but no way for me to turn it into a ratio so that I would know how similar the documents were. Copyleaks' free trial didn't allow me to enter documents as large as my thesis. DiffPDF seems to produce a percentage similarity for each page, but I need it for the total document. The closest I got was text-sim. It returned a percentage similarity of 97.6%. This might very well be the most "correct" answer. I feel like the two documents are more dissimilar than that. I might be wrong, of course. If at least one other program could confirm this percentage, I would be happy.
Solution 1:[1]
I anyone comes across this post later, I thought I'd note how I finally solved this:
I was able to try out a trial version of DiffPDF. The app does actually calculate a percentage score across the whole of the document, not just for each page as I thought. Note that you have to make sure that the two documents are structurally similar: I made each section (including all subsections) start on a new page, and made sure that each section started on the same page number in both documents. DiffPDF makes comparisons page by page, so if the pages are out of sync they'll be marked as different. I also removed the reference list and table of contents, since they'd be the same anyway. In the end, my finished version of the document, and the version in which I accepted all suggested edits, got a score of 40%. 0% means two identical documents, so this indicates that the two documents were more similar than they were different. Still, not that similar.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Blue badger |