Cosine Similarity between 2 Number Lists
Categories:
Understanding Cosine Similarity: A Practical Guide for Number Lists
Explore cosine similarity, a fundamental metric for measuring the similarity between two non-zero vectors. Learn its mathematical foundation and practical implementation in Python for comparing number lists.
Cosine similarity is a widely used metric in data science, machine learning, and natural language processing to quantify the similarity between two non-zero vectors. Unlike Euclidean distance, which measures the magnitude of the difference between vectors, cosine similarity focuses on the angle between them. A smaller angle indicates higher similarity, with an angle of 0 degrees (cosine of 1) meaning identical direction, and an angle of 90 degrees (cosine of 0) meaning no similarity (orthogonality). This article will delve into the concept, its mathematical formulation, and provide a clear Python implementation for comparing number lists.
The Mathematical Foundation of Cosine Similarity
At its core, cosine similarity is derived from the dot product of two vectors and their magnitudes. For two vectors, A and B, the cosine similarity is calculated as:
Cosine Similarity Formula
Where:
A ⋅ B
represents the dot product of vectors A and B. For two lists of numbers,A = [a1, a2, ..., an]
andB = [b1, b2, ..., bn]
, the dot product isa1*b1 + a2*b2 + ... + an*bn
.||A||
denotes the Euclidean magnitude (or L2 norm) of vector A, calculated assqrt(a1^2 + a2^2 + ... + an^2)
.||B||
denotes the Euclidean magnitude of vector B, calculated similarly.
The result ranges from -1 to 1:
1
: Indicates that the vectors are identical in direction.0
: Indicates that the vectors are orthogonal (no similarity).-1
: Indicates that the vectors are diametrically opposed (completely dissimilar).
Implementing Cosine Similarity in Python
Python provides straightforward ways to implement cosine similarity, either from scratch or by leveraging libraries like NumPy for optimized performance. We will demonstrate both approaches, focusing on clarity for lists of numbers.
import math
def dot_product(vec1, vec2):
return sum(a * b for a, b in zip(vec1, vec2))
def magnitude(vec):
return math.sqrt(sum(a**2 for a in vec))
def cosine_similarity(vec1, vec2):
if not vec1 or not vec2:
raise ValueError("Vectors cannot be empty")
if len(vec1) != len(vec2):
raise ValueError("Vectors must have the same dimension")
dot = dot_product(vec1, vec2)
mag1 = magnitude(vec1)
mag2 = magnitude(vec2)
if mag1 == 0 or mag2 == 0:
return 0.0 # Or handle as an error, depending on desired behavior
return dot / (mag1 * mag2)
# Example usage:
vector_a = [1, 1, 0, 1, 0, 1, 0, 0, 1]
vector_b = [0, 1, 1, 0, 1, 0, 1, 1, 1]
similarity = cosine_similarity(vector_a, vector_b)
print(f"Cosine Similarity (Manual): {similarity}")
vector_c = [3, 4]
vector_d = [6, 8]
similarity_parallel = cosine_similarity(vector_c, vector_d)
print(f"Cosine Similarity (Parallel): {similarity_parallel}")
vector_e = [1, 0]
vector_f = [0, 1]
similarity_orthogonal = cosine_similarity(vector_e, vector_f)
print(f"Cosine Similarity (Orthogonal): {similarity_orthogonal}")
vector_g = [1, 2]
vector_h = [-1, -2]
similarity_opposite = cosine_similarity(vector_g, vector_h)
print(f"Cosine Similarity (Opposite): {similarity_opposite}")
Manual Python implementation of cosine similarity
0.0
for zero magnitude vectors is a common practice, implying no directional similarity.Leveraging NumPy for Efficiency
For large datasets or performance-critical applications, NumPy provides highly optimized functions for vector operations, making it the preferred choice for numerical computations in Python.
import numpy as np
def cosine_similarity_numpy(vec1, vec2):
vec1 = np.array(vec1)
vec2 = np.array(vec2)
if vec1.size == 0 or vec2.size == 0:
raise ValueError("Vectors cannot be empty")
if vec1.shape != vec2.shape:
raise ValueError("Vectors must have the same dimension")
dot = np.dot(vec1, vec2)
norm_vec1 = np.linalg.norm(vec1)
norm_vec2 = np.linalg.norm(vec2)
if norm_vec1 == 0 or norm_vec2 == 0:
return 0.0
return dot / (norm_vec1 * norm_vec2)
# Example usage with NumPy:
vector_a_np = [1, 1, 0, 1, 0, 1, 0, 0, 1]
vector_b_np = [0, 1, 1, 0, 1, 0, 1, 1, 1]
similarity_np = cosine_similarity_numpy(vector_a_np, vector_b_np)
print(f"Cosine Similarity (NumPy): {similarity_np}")
vector_c_np = [3, 4]
vector_d_np = [6, 8]
similarity_parallel_np = cosine_similarity_numpy(vector_c_np, vector_d_np)
print(f"Cosine Similarity (NumPy, Parallel): {similarity_parallel_np}")
NumPy implementation of cosine similarity
Practical Applications
Cosine similarity is incredibly versatile and finds applications in various domains:
- Information Retrieval: Ranking documents by their relevance to a search query.
- Recommender Systems: Suggesting similar items (e.g., movies, products) to users based on their preferences.
- Text Analysis: Comparing the similarity of two documents or sentences based on word embeddings or TF-IDF vectors.
- Image Processing: Comparing image features represented as vectors.
Its focus on direction rather than magnitude makes it particularly useful when vector length is not a significant factor, such as in text analysis where document length might vary but the topical content could be similar.
1. Step 1
Define your vectors: Prepare your numerical data as lists or NumPy arrays. Ensure they have the same number of dimensions.
2. Step 2
Choose your implementation: Decide whether to use a manual Python implementation for clarity or NumPy for performance with larger datasets.
3. Step 3
Handle edge cases: Implement checks for empty vectors or vectors with zero magnitude to avoid division by zero errors.
4. Step 4
Interpret the result: A value close to 1 indicates high similarity, 0 indicates no similarity, and -1 indicates complete opposition.