'Similarity of documents function

I am trying to create matrices for cosine and euclidean distances of a document. not too sure how I would approach this question. Any advice would be appreciated. Thanks.

The function takes the termdoc matrix as the input and computes variables called "euclidean_distance_matrix" and "cosine_distance_matrix", which are matrices whose elements (i,j) store the Eulidean distance and the cosine distance between tweet i-th and i-jth. You should store the distance matrices in numpy arrays for easier implementation in subsequent tasks

The code to start me off is below.

def compute_distance_matrices(termdoc):


Solution 1:[1]

I think this might give you an idea

def compute_distance_matrices(termdoc):
    # initialising arrays
    euclidean_distance_matrix = np.zeros(shape=(termdoc.shape[0], termdoc.shape[0]))
    cosine_distance_matrix = np.zeros(shape=(termdoc.shape[0], termdoc.shape[0]))
    
    # looping over each tweet in matrix
    for i, vector1 in enumerate(termdoc):
        
        # looping over each tweet in matrix
        for j, vector2 in enumerate(termdoc):

            # computing euclidean and cosine distances
            euclidean_distance_matrix[i, j] = Euclidean_distance(vector1, vector2)
            cosine_distance_matrix[i, j] = cosine_distance(vector1, vector2)
    
    return euclidean_distance_matrix, cosine_distance_matrix

Solution 2:[2]

You can approach this problem as a probabilistic one. You have to:

  1. Construct the frequency vector for each document
  2. Compute the cosine or euclidean distance between the documents

Frequency vector

You will have to compute the TF-IDF parameters for each of the words in the document and organize all it in a vector. In simple words, TF is the word frequency and IDF is used to balance the hight frequency word. TF-IDF represents the importance of a word in a corpus.

This link might be useful: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Cosine distance

Apply the formula and judge the result: lower values = similar documents

This link might be useful: https://en.wikipedia.org/wiki/Cosine_similarity

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 SimonB
Solution 2 Henrique