A write-up on the WWW '24 paper by H. Steck et al., "Is Cosine-Similarity of Embeddings Really About Similarity? " Netflix Inc. & Cornell University, 2024.
Acknowledgements: This post was written by Amarpreet Kaur and reviewed by Tullie Murrell .
The Promise and Peril of Cosine Similarity
Cosine similarity, which measures the cosine of the angle between two vectors, has found widespread use in various applications, from recommender systems to natural language processing. Its popularity stems from the belief that it captures the directional alignment between embedding vectors, supposedly providing a more meaningful measure of similarity than simple dot products.
However, the research team, led by Harald Steck, Chaitanya Ekanadham, and Nathan Kallus, has uncovered a significant issue: in certain scenarios, cosine similarity can yield arbitrary results, potentially rendering the metric unreliable and opaque.
Unraveling the Mystery: A Deep Dive into Matrix Factorization
To understand the root of this problem, the researchers focused on linear Matrix Factorization (MF) models, which allow for closed-form solutions and theoretical analysis. These models are commonly used in recommender systems and other applications to learn low-dimensional embeddings of discrete entities.
The study examined two popular training objectives for MF models:
Where X is the input data matrix, A and B are the learned embedding matrices, and λ is a regularization parameter.
The Culprit: Regularization and Degrees of Freedom
The researchers discovered that the first objective, which is equivalent to learning with denoising or dropout, introduces a critical degree of freedom in the learned embeddings. This freedom allows for arbitrary rescaling of the embedding dimensions without affecting the model's predictions.
Mathematically, if  and B̂ are solutions to the first objective, then ÂD and B̂D^(-1) are also solutions for any diagonal matrix D. This rescaling affects the normalization of the learned embeddings, which in turn impacts the cosine similarities between them.
Striking Examples of Arbitrary Results
The study presents some eye-opening examples of how this arbitrariness can manifest:
1. In a full-rank MF model, by choosing D appropriately, the item-item cosine similarities can be made to equal the identity matrix. This bizarre result suggests that each item is only similar to itself and completely dissimilar to all other items!
2. With a different choice of D, the user-user cosine similarities can be reduced to simplify ΩA · X · X^T · ΩA, where X is the raw data matrix. This means the similarities are based solely on the raw data, without any benefit from the learned embeddings.
The Second Objective: A Unique but Potentially Suboptimal Solution
The researchers found that the second objective, which regularizes each matrix individually, leads to a unique solution (up to rotations). While this avoids the arbitrariness issue, it's unclear whether the resulting cosine similarities are optimal for capturing semantic relationships.
Implications Beyond Linear Models
Although the study focused on linear MF models, the authors caution that similar issues may arise in more complex scenarios:
1. Deep learning models often employ a combination of different regularization techniques, which could have unintended effects on cosine similarities of the resulting embeddings.
2. The practice of applying cosine similarity to embeddings learned through dot product optimization may lead to opaque and potentially meaningless results.
Potential Solutions and Alternatives
The researchers suggest several approaches to address these issues:
1. Train models directly with respect to cosine similarity, possibly facilitated by techniques like layer normalization.
2. Avoid working in the embedding space entirely. Instead, project the embeddings back to the original space before applying cosine similarity.
3. Apply normalization or reduce popularity bias before or during the learning process, rather than only normalizing after learning as done in cosine similarity.
Alternatives to Cosine Similarity for Semantic Analysis
Given the limitations of cosine similarity for semantic analysis in embedding models, several alternative approaches have been proposed:
- Euclidean distance: While less popular for text data due to sensitivity to vector magnitudes, it can be effective when embeddings are properly normalized.
- Dot product: The unnormalized dot product between embedded vectors has been found to outperform cosine similarity in some applications, particularly for dense passage retrieval and question answering tasks.
- Soft cosine similarity: This method incorporates semantic information by considering the similarity between individual words in addition to their vector representations, potentially offering more nuanced comparisons.
- Semantic Textual Similarity (STS) prediction: Fine-tuned models trained specifically for semantic similarity tasks, such as STSScore, have shown promise in providing more robust and interpretable similarity measures.
- Normalized embeddings with cosine similarity: Applying normalization techniques like layer normalization before using cosine similarity can help mitigate some of its shortcomings.
When selecting an alternative, it's crucial to consider the specific requirements of the task, the nature of the data, and the model architecture being used. Empirical evaluation on domain-specific datasets is often necessary to determine the most suitable similarity measure for a given application.
Rethinking AI Tools: Implications for Developers and the Machine Learning Community
Netflix’s study, led by top researchers Harald Steck, Chaitanya Ekanadham, and Nathan Kallus, highlights the critical need for developers and the AI community to scrutinize widely accepted tools and techniques, particularly those underpinning recommendation systems, LLMs, and vector stores. The research exposes the limitations of cosine similarity with learned embeddings, revealing its potential to yield arbitrary and unreliable results. This serves as a call for more nuanced approaches, including the exploration of alternative similarity measures and a deeper understanding of regularization techniques and their effects on semantic relationships. By urging the community to question assumptions and rigorously evaluate the tools they use, this groundbreaking study emphasizes the importance of critical analysis and task-specific evaluations in building more robust, reliable AI systems capable of addressing real-world challenges.