A typical evaluation workflow involves two main stages: offline and online evaluation. Offline evaluation uses historical data to predict user interactions and measure model performance with metrics like precision and recall. Once the model performs well offline, it moves to online evaluation, where it is tested in a live environment using A/B testing to track its impact on real user behavior and business objectives such as impressions, clicks and conversion.
Offline Evaluation
Offline evaluation assesses a model's performance using historical data in a controlled environment to predict user interactions and measure various metrics without impacting live users. This method is essential for tuning and improving models before deploying them into a live environment.
Evaluating Metrics
During offline evaluation, algorithms are quantitatively assessed by how well they predict relevant user interactions on a hold-out set of data using metrics such as precision, recall, mean average precision (MAP), and normalized discounted cumulative gain (NDCG). Qualitative evaluation also plays a role, involving an examination of the descriptive analytics of the recommendations, such as the distribution and diversity of recommendations in the top-k results.
What’s a hold-out set?
A hold-out set is a subset of your data that you deliberately avoid training on so that you can test the model's performance on unseen data. For production machine-learning use cases like recommendation systems, it's important to split this hold-out set chronologically to test the model's performance on future data and avoid time-based data leakage.
Biases in Offline Evaluation Metrics
Offline evaluation metrics are a great way to get a sense of how well your model is performing; however, they can sometimes be misleading. The biggest problem is that predicting a hold-out set of interactions is not the same as predicting what your users will actually interact with. Offline evaluation is observational—we're evaluating how well we fit logged data—rather than interventional—evaluating how changing the recommendation algorithm leads to different outcomes (e.g. purchases). If the logged data is biased in any way, this can lead to misleading results. Here are some examples of bias commonly seen:
- Data Delivery Bias: Your interactions will be biased towards the historic delivery mechanism used to surface recommendations. For example, if you've only been showing users the most popular items for the last year, your interactions will have a significant bias towards popular items. In this case, typically the best algorithm on the hold-out set will be the same one you're using to serve the data; however, this doesn't mean it's the best algorithm for your users.
- Cold-start bias: Related to data delivery bias, but it's so common it deserves its own point: Your interactions will be biased towards older or newer items. For example, new items may have fewer interactions, which means they're not weighted as highly within the hold-out set.
- Observational bias: Even in a perfect world with no data delivery biases, where all items were historically served completely randomly, the algorithms will still be biased towards an environment that isn't affected by the candidate recommendation itself. Once deploying the algorithm to production, the way users interact with items will change and, therefore, the model's performance will change.
Mitigating Biases in Offline Evaluation
Several techniques have been developed to address the above biases:
- Counterfactual Evaluation: Counterfactual evaluation aims to estimate what would have happened if a different recommendation policy had been used. This involves modeling the biases and adjusting the evaluation metrics accordingly.
- Inverse Propensity Scoring (IPS): A method to adjust for exposure bias by weighting interactions based on their propensity scores, which estimate the probability that an item was seen by a user. The corrected metric provides an unbiased estimate of the true performance of the recommendation algorithm.
- Debiasing Techniques: These techniques aim to correct for biases present in the data:
- Popularity Strata: Segregating test data into different strata based on item popularity ensures that evaluations do not disproportionately favor popular items.
- Equal Sampling: Ensuring an equal number of interactions for each item type, thereby reducing the impact of popularity bias.
- Unbiased Data Collection: Collecting random samples of user interactions can help create an unbiased dataset. However, this approach is resource-intensive and challenging to scale.
Addressing evaluation complexity
To address the complexities of offline evaluation, several techniques and methodologies are employed:
- Data Partitioning:
- Random Sampling: This involves randomly splitting the data into training and test sets. While simple, this method can ignore temporal dynamics in user interactions, which are crucial for evaluating recommendations.
- Temporal Sampling: Data is split based on time, with earlier interactions used for training and later interactions for testing. This approach helps simulate the real-world scenario where models predict future user behavior based on past interactions. It also avoids temporal data leakage, ensuring that the evaluation is realistic.
- Candidate Item Set Subsampling: Evaluated systems are often required to rank a subset of items rather than the entire catalog. This subset can be determined by various factors:
- Training Set Exclusion: Excluding items seen during training to avoid overfitting.
- Popularity-based Sampling: Including a mix of popular and less popular items to ensure that the model's ability to recommend diverse items is tested.
- Dynamic Subsets: Using contextually relevant subsets, such as items that have recently become popular or those that align with current trends.
- Metrics Beyond Accuracy: Traditional accuracy metrics like precision and recall are necessary but not sufficient for a holistic evaluation of recommender systems. Other important metrics include:
- Novelty: Measures how unexpected the recommendations are. High novelty can enhance user satisfaction by introducing them to new and interesting items.
- Diversity: Evaluates how varied the recommendations are. Diverse recommendations can cater to multiple user interests and prevent the over-concentration on a narrow set of items.
- Serendipity: Assesses the ability of the system to recommend items that are not only relevant but also pleasantly surprising.
We recommend looking at some of the attached resources at the end of the article to understand common metrics for evaluation.
Offline Metric Evaluation as a Compass
Considering all the issues with offline metric evaluation, how do we interpret the results?
We like to think of offline metric evaluation as a compass rather than a map. You can use it to understand characteristics of the model relative to baseline algorithms; however, you can't interpret these metrics too literally. For example, a precision of 10% doesn't mean that 10% of the items within a slate size will be relevant in a live test; however, if it's 1% better than a trending baseline, it's a good sign it's worth evaluating in an online setting. Note: even if it is 1% worse, it might still be worthwhile evaluating in an online setting if the results are more diverse than the baseline or you know the sampled data is biased towards the baseline in some severe way.
User Drill Down Analysis
Within the offline evaluation stage, it's also critical to qualitatively evaluate the candidate model by closely examining a sample of recommendations from different users. For example, for a book recommendation model, you might find a user who has only interacted with romance books and confirm that the model is recommending mostly romance books to that user.
Evaluating the model in this way can help sanity check that everything is working as expected. If we see unexpected qualitative results despite seeing good quantitative results, it may mean the objective used to train/evaluate the model is incorrect.
The Problems With User Drill Down Analysis
The biggest issue with user drill down is the human biases that come in when evaluating the results. This typically happens in two ways: user-selection biases and product biases.
- User-selection biases: Say you're evaluating a recommendation model and instead of a random user, you pick your own internal user. You know your interests best, so it might seem obvious to try yourself first. The problem is you are likely biased in ways related to being an employee at the company. You might have internal features that result in a different user experience than the average user, and maybe your interactions don't reflect your true interests because you test the product constantly. Sometimes even choosing a random "power user" can be misleading as these power users are actually employees or have some other bias that makes them less useful to manually evaluate. We suggest choosing several random users when evaluating.
- Product biases: The other human bias that's common comes from preconceived product biases of what you might think users are interested in compared to what they're actually interested in. For example, assuming that a user's demographic is a good predictor of their interests when in fact it's not. Sometimes it's best not to be overly prescriptive about what you expect users to see, and as long as the results aren't majorly wrong, let the online metrics speak for themselves.
Online Evaluation
Online evaluation occurs after you've deployed your model to production and are serving end-users with results from your algorithm. This is the gold-standard of evaluation as you can objectively track the impact of your model on your target business objectives (e.g. clicks, purchases) in an interventional way.
Online Evaluation Methods
- A/B Testing: Typically when first deploying a new algorithm to production, you'll run an A/B test where you serve the new algorithm to a subset of users and compare the results to a control group that's served the old algorithm. This is the best way to understand the impact of the new algorithm on your business objectives relative to the old and removes confounders that might affect the evaluation metrics (E.g. seasonality may affect purchase rates in a way that's not related to the recommendation algorithm).
- Multi-Armed Bandit Testing: Multi-armed bandit testing is a sophisticated approach to model evaluation that dynamically allocates traffic to different models based on their performance. This method balances the need for exploration (trying out different models) and exploitation (favoring the best-performing models). In practice, algorithms such as epsilon-greedy, UCB (Upper Confidence Bound), or Thompson Sampling are employed to determine which model to present to users at any given time. By using these algorithms, the system can efficiently identify and prioritize the most effective recommendation models, thereby optimizing user experience and engagement. This approach not only enhances the performance of the recommender system but also ensures that the best models receive more exposure and testing opportunities.
- Interleaving: Interleaving is another technique used in online model evaluation where recommendations from different models are presented together to the same users. This can be achieved by interleaving items from different models within the recommendation list. By tracking user interactions with these interleaved lists, it becomes possible to determine which model's recommendations are more preferred by users. Interleaving offers a direct comparison between models under identical conditions, providing clear insights into their relative effectiveness. This method is particularly useful for fine-tuning models and making informed decisions about which model to deploy broadly based on actual user preferences and behaviors.
Pitfalls of Online Evaluation
The main problem with online evaluation is that it's time-consuming. It can take a while to set up correctly, particularly if you don't have a solid experimentation framework. You also have to wait for enough data to be collected to make a statistically significant decision (e.g. greater than two weeks). Despite this, as an objective measure of uplift, it's nearly always worth it once you feel confident the offline results are at least comparable with a baseline.
That all being said, there can be several pitfalls during online evaluation that are worth mentioning:
- Looking only at one metric: If you only look at one metric, you may be optimizing for that metric at the expense of others. For example, if you're optimizing for click-through rate, you might end up recommending the same popular items to everyone, which might not be the best for your business in the long run. We recommend looking at a suite of metrics to understand the full picture.
- Looking at only the aggregate data: If you only look at the aggregate data, you might miss important sub-populations that are being affected by the algorithm in different ways. For example, if you're optimizing for purchases, you might miss that the algorithm is actually decreasing the number of purchases from your most loyal users. We recommend looking at the results of the A/B test across different user segments.
- Focusing on short-term signals: If you only look at short-term signals like clicks, you might miss the long-term impact of the algorithm, such as 30-day retention. Even if the algorithm is increasing clicks in the short term, it might be worthwhile holding a long-running experiment indefinitely that keeps a baseline algorithm shown to a small subset of users (e.g. 5%).
Conclusion
We've discussed a typical evaluation workflow for recommendation models, notably the main stages of offline and online evaluation. By understanding the strengths and limitations of each evaluation stage and being mindful of potential biases and pitfalls, you can better assess the performance of your recommendation system and ensure it delivers value to your users.
If you want to dive deeper, take a look at some of the resources below where we explore the specifics of different evaluation metrics and methodologies:
- Evaluating Recommendation Systems -- Precision@k, Recall@k, and R-Precision
- Evaluting recommendation systems -- mAP, MMR, NDCG
- Evaluating recommendation systems (ROC, AUC, and Precision-Recall)
- Not your average RecSys metrics. Part 1: Serendipity
- Not your average RecSys metrics Part 2: Novelty
- Counterfactual Evaluation for Recommendation Systems