'Calculating Similarity Between Pairs of Documents in R [closed]

How can I calculate the cosine semantic similarity between pairs of word documents in R?

Specifically, I have the plot (i.e., descriptions) of movie sequels and their original films and want to see how similar the plot of the sequel is with the original film.



Solution 1:[1]

As a baseline, I would use a bag of words approach, first unweighted then with tf-idf weighting. Once you have your vectors, calculate the cosine similarity. Here is an sklearn implementation taken from this answer.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from scipy import spatial
import pandas as pd
clf = CountVectorizer(ngram_range=(1,1))
clf.fit(pd.concat([df.originalplot, df.sequelplot]))
originalplot = clf.transform(df.originalplot).todense()
sequelplot= clf.transform(df.sequelplot).todense()
similarities = [1- spatial.distance.cosine(originalplot[x], sequelplot[x]) for x in range(len(sequelplot))]
similarities
# use 'clf = TfidfVectorizer(ngram_range=(1, 1))' at the top for a tf-idf wieghted score. 

As a more advanced technique, you can use word embeddings to try and capture not just 1-1 vocabulary matches but also semantically similar words. There are off the self word-embeddings trained on some large corpus. Alternatively, you could train it specifically on your corpus. A sample off the shelf implementation in spaCy, again measuring cosine similarity of the vectors:

import spacy 
nlp = spacy.load("en_core_web_md")
df["original_spacy"] = df.originalplot.apply(nlp)
df["sequel_spacy"] = df.sequelplot.apply(nlp)
df["similarity"] = df.apply(lambda row: row.sequelplot.similarity(row.original_spacy), axis=1)

Note all the above code is a starting point (and could be optimized if you care about speed). You will likely want to refine it and add or subtract transformations (stop-word removal, stemming, lemmatization) as you play around with your data. Check out this Paul Minogue blog post for a more in-depth explanation of these two approaches. If you want to use R, text2vec should have implementations of all the above concepts.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 tbrk