Cosine Similarity and Euclidean distance plan an important role in machine learning space helping machines systematically 'understand' data. Euclidean distance measures the straight-line distance between two points, while cosine similarity measures the angle between two vectors, focusing on their direction rather than magnitude. So the Euclidean distance is sensitive to the magnitude of the vectors while cosine similarity ignores the magnitude of the vectors and focuses on their direction. Hence cosine similarity is helpful in for comparing documents, text data, or other situations where the direction of the vectors is more important than their length and Euclidean distance is useful for applications where the absolute distance between points is important, such as spatial data or numerical data.
Cosine Similarity and Euclidean distance plan an important role in machine learning space helping machines systematically 'understand' data. Euclidean distance measures the straight-line distance between two points, while cosine similarity measures the angle between two vectors, focusing on their direction rather than magnitude. So the Euclidean distance is sensitive to the magnitude of the vectors while cosine similarity ignores the magnitude of the vectors and focuses on their direction. Hence cosine similarity is helpful in for comparing documents, text data, or other situations where the direction of the vectors is more important than their length and Euclidean distance is useful for applications where the absolute distance between points is important, such as spatial data or numerical data.
Lets a take a quick look at both:
Cosine Similarity
In document vectors, attributes represent either the presence or absence of a word. It is possible to construct a more informational vector with the number of occurrences in the document, instead of just 1 and 0. Document datasets are usually long vectors with thousands of variables or attributes. For simplicity, consider the example of the vectors with X (1,2,0,0,3,4,0) and Y (5,0,0,6,7,0,0). The cosine similarity measure for two data points is given by:
Cosine similarity ( | X , Y | ) = x ⋅ y | | x | | | | y | | where x·y is the dot product of the x and y vectors with, for this example,
and
Lets a take a quick look at both:
Cosine Similarity
In document vectors, attributes represent either the presence or absence of a word. It is possible to construct a more informational vector with the number of occurrences in the document, instead of just 1 and 0. Document datasets are usually long vectors with thousands of variables or attributes. For simplicity, consider the example of the vectors with X (1,2,0,0,3,4,0) and Y (5,0,0,6,7,0,0). The cosine similarity measure for two data points is given by:
where x·y is the dot product of the x and y vectors with, for this example,
and
The cosine similarity measure is one of the most used similarity measures, but the determination of the optimal measure comes down to the data structures. The choice of distance or similarity measure can also be parameterized, where multiple models are created with each different measure. The model with a distance measure that best fits the data with the smallest generalization error can be the appropriate proximity measure for the data. Simple implementation using numpy:
import numpy as np
def cosine_similarity(a, b):
a_norm = np.linalg.norm(a)
b_norm = np.linalg.norm(b)
if a_norm == 0 or b_norm == 0:
return 0
return np.dot(a, b) / (a_norm * b_norm)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity}")
What is Euclidean distance?
It's the straight-line distance between two points in space — like measuring with a ruler.
Here's a simple explanation of Euclidean distance in mathematical terms:
In 2D (two dimensions)
If you have two points:
-
Point A:
-
Point B:
The Euclidean distance between them is:
This is based on the Pythagorean theorem — the legs of a right triangle are the differences in x and y, and the distance is the hypotenuse.
In 3D (three dimensions)
If the points are:
-
Point A:
-
Point B:
Then:
General formula (for n dimensions):
If points A and B have coordinates:
-
A =
-
B =
Then:
No comments:
Post a Comment