Cosine Similarity between 2 Number Lists

Learn cosine similarity between 2 number lists with practical examples, diagrams, and best practices. Covers python, cosine-similarity development techniques with visual explanations.

Understanding Cosine Similarity: A Practical Guide for Number Lists

Understanding Cosine Similarity: A Practical Guide for Number Lists

Explore cosine similarity, a fundamental metric for measuring the similarity between two non-zero vectors. Learn its mathematical foundation and practical implementation in Python for comparing number lists.

Cosine similarity is a widely used metric in data science, machine learning, and natural language processing to quantify the similarity between two non-zero vectors. Unlike Euclidean distance, which measures the magnitude of the difference between vectors, cosine similarity focuses on the angle between them. A smaller angle indicates higher similarity, with an angle of 0 degrees (cosine of 1) meaning identical direction, and an angle of 90 degrees (cosine of 0) meaning no similarity (orthogonality). This article will delve into the concept, its mathematical formulation, and provide a clear Python implementation for comparing number lists.

The Mathematical Foundation of Cosine Similarity

At its core, cosine similarity is derived from the dot product of two vectors and their magnitudes. For two vectors, A and B, the cosine similarity is calculated as:

A diagram illustrating the cosine similarity formula. The formula is: similarity = (A ⋅ B) / (||A|| * ||B||), where A ⋅ B is the dot product of vectors A and B, and ||A|| and ||B|| are their Euclidean magnitudes. Vectors A and B are shown originating from the origin in a 2D Cartesian plane, with the angle theta between them labeled.

Cosine Similarity Formula

Where:

  • A ⋅ B represents the dot product of vectors A and B. For two lists of numbers, A = [a1, a2, ..., an] and B = [b1, b2, ..., bn], the dot product is a1*b1 + a2*b2 + ... + an*bn.
  • ||A|| denotes the Euclidean magnitude (or L2 norm) of vector A, calculated as sqrt(a1^2 + a2^2 + ... + an^2).
  • ||B|| denotes the Euclidean magnitude of vector B, calculated similarly.

The result ranges from -1 to 1:

  • 1: Indicates that the vectors are identical in direction.
  • 0: Indicates that the vectors are orthogonal (no similarity).
  • -1: Indicates that the vectors are diametrically opposed (completely dissimilar).

Implementing Cosine Similarity in Python

Python provides straightforward ways to implement cosine similarity, either from scratch or by leveraging libraries like NumPy for optimized performance. We will demonstrate both approaches, focusing on clarity for lists of numbers.

import math

def dot_product(vec1, vec2):
    return sum(a * b for a, b in zip(vec1, vec2))

def magnitude(vec):
    return math.sqrt(sum(a**2 for a in vec))

def cosine_similarity(vec1, vec2):
    if not vec1 or not vec2:
        raise ValueError("Vectors cannot be empty")
    if len(vec1) != len(vec2):
        raise ValueError("Vectors must have the same dimension")

    dot = dot_product(vec1, vec2)
    mag1 = magnitude(vec1)
    mag2 = magnitude(vec2)

    if mag1 == 0 or mag2 == 0:
        return 0.0 # Or handle as an error, depending on desired behavior

    return dot / (mag1 * mag2)

# Example usage:
vector_a = [1, 1, 0, 1, 0, 1, 0, 0, 1]
vector_b = [0, 1, 1, 0, 1, 0, 1, 1, 1]

similarity = cosine_similarity(vector_a, vector_b)
print(f"Cosine Similarity (Manual): {similarity}")

vector_c = [3, 4]
vector_d = [6, 8]
similarity_parallel = cosine_similarity(vector_c, vector_d)
print(f"Cosine Similarity (Parallel): {similarity_parallel}")

vector_e = [1, 0]
vector_f = [0, 1]
similarity_orthogonal = cosine_similarity(vector_e, vector_f)
print(f"Cosine Similarity (Orthogonal): {similarity_orthogonal}")

vector_g = [1, 2]
vector_h = [-1, -2]
similarity_opposite = cosine_similarity(vector_g, vector_h)
print(f"Cosine Similarity (Opposite): {similarity_opposite}")

Manual Python implementation of cosine similarity

Leveraging NumPy for Efficiency

For large datasets or performance-critical applications, NumPy provides highly optimized functions for vector operations, making it the preferred choice for numerical computations in Python.

import numpy as np

def cosine_similarity_numpy(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)

    if vec1.size == 0 or vec2.size == 0:
        raise ValueError("Vectors cannot be empty")
    if vec1.shape != vec2.shape:
        raise ValueError("Vectors must have the same dimension")

    dot = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)

    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0.0

    return dot / (norm_vec1 * norm_vec2)

# Example usage with NumPy:
vector_a_np = [1, 1, 0, 1, 0, 1, 0, 0, 1]
vector_b_np = [0, 1, 1, 0, 1, 0, 1, 1, 1]

similarity_np = cosine_similarity_numpy(vector_a_np, vector_b_np)
print(f"Cosine Similarity (NumPy): {similarity_np}")

vector_c_np = [3, 4]
vector_d_np = [6, 8]
similarity_parallel_np = cosine_similarity_numpy(vector_c_np, vector_d_np)
print(f"Cosine Similarity (NumPy, Parallel): {similarity_parallel_np}")

NumPy implementation of cosine similarity

Practical Applications

Cosine similarity is incredibly versatile and finds applications in various domains:

  • Information Retrieval: Ranking documents by their relevance to a search query.
  • Recommender Systems: Suggesting similar items (e.g., movies, products) to users based on their preferences.
  • Text Analysis: Comparing the similarity of two documents or sentences based on word embeddings or TF-IDF vectors.
  • Image Processing: Comparing image features represented as vectors.

Its focus on direction rather than magnitude makes it particularly useful when vector length is not a significant factor, such as in text analysis where document length might vary but the topical content could be similar.

1. Step 1

Define your vectors: Prepare your numerical data as lists or NumPy arrays. Ensure they have the same number of dimensions.

2. Step 2

Choose your implementation: Decide whether to use a manual Python implementation for clarity or NumPy for performance with larger datasets.

3. Step 3

Handle edge cases: Implement checks for empty vectors or vectors with zero magnitude to avoid division by zero errors.

4. Step 4

Interpret the result: A value close to 1 indicates high similarity, 0 indicates no similarity, and -1 indicates complete opposition.