'Similarity of documents function
I am trying to create matrices for cosine and euclidean distances of a document. not too sure how I would approach this question. Any advice would be appreciated. Thanks.
The function takes the termdoc matrix as the input and computes variables called "euclidean_distance_matrix" and "cosine_distance_matrix", which are matrices whose elements (i,j) store the Eulidean distance and the cosine distance between tweet i-th and i-jth. You should store the distance matrices in numpy arrays for easier implementation in subsequent tasks
The code to start me off is below.
def compute_distance_matrices(termdoc):
Solution 1:[1]
I think this might give you an idea
def compute_distance_matrices(termdoc):
# initialising arrays
euclidean_distance_matrix = np.zeros(shape=(termdoc.shape[0], termdoc.shape[0]))
cosine_distance_matrix = np.zeros(shape=(termdoc.shape[0], termdoc.shape[0]))
# looping over each tweet in matrix
for i, vector1 in enumerate(termdoc):
# looping over each tweet in matrix
for j, vector2 in enumerate(termdoc):
# computing euclidean and cosine distances
euclidean_distance_matrix[i, j] = Euclidean_distance(vector1, vector2)
cosine_distance_matrix[i, j] = cosine_distance(vector1, vector2)
return euclidean_distance_matrix, cosine_distance_matrix
Solution 2:[2]
You can approach this problem as a probabilistic one. You have to:
- Construct the frequency vector for each document
- Compute the cosine or euclidean distance between the documents
Frequency vector
You will have to compute the TF-IDF parameters for each of the words in the document and organize all it in a vector. In simple words, TF is the word frequency and IDF is used to balance the hight frequency word. TF-IDF represents the importance of a word in a corpus.
This link might be useful: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Cosine distance
Apply the formula and judge the result: lower values = similar documents
This link might be useful: https://en.wikipedia.org/wiki/Cosine_similarity
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | SimonB |
Solution 2 | Henrique |