Is Data Splitting Making or Breaking Your Recommender System?

How frustrating it is if a dessert you made at home simply refuses to taste as good if made at a friend’s place? A good recipe should work even if you change kitchens. This should especially be true in scientific experiments. Using the same data on the same models, you must have the expected results replicated. However, in the developing world of recommender systems research, things are less than ideal.

February 5, 2025

min read

Param Raval

In this article we discuss the paper Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. Through the paper the authors focus on data splitting strategies – something usually overlooked or taken for granted in comparing ML models. They argue that recommender models that claim to be state-of-the-art are vulnerable to changes in data splitting strategies. In other past research works, similar arguments have been made for sequential data like time series and videos or machine learning algorithms in general.

The bottom line being — with deep neural network models, new research fights for the state-of-the-art by only a fraction increase in performance than the existing. If this margin of improvement can vanish by changing how the data is split then there remains no reliable way of knowing which recommender models truly work the best.

The paper constructs its argument by attempting to prove two points:

Existing models use a variety of data splitting strategies.
The performance of state-of-the-art models, on the same dataset, is impacted when their data splitting method is changed.

Diversity in data splitting in recommendation models

Consider 17 approaches which were state-of-the-art at some point at the time the original paper came out. Some report results on more than one of the splitting methods and several use the same datasets.An overview of the features and usage of these methods is seen in the table below.

‍

Method	Salient feature/s	Key limitation/s	# models using the method
Leave One Last Item	Hold out the last interaction of each user as the test set.	Leaks item/user trends into the test set.	6
Leave One Last Basket/Session	Same as above but hold out the entire cart/order of the last interaction of each user.		3
Temporal User Split	Hold out a fixed fraction of the latest interactions of each user as the test set.	Varying boundaries per user can leak trends into the test set.	5

‍

A common problem that the leave-one-last and temporal split methods have is that they, because of their design and usage, leak information from the training set into the test set. One instance where this happens is when the model may see items from the test set getting popular in other training set users around the same time points.

This is a huge red flag in any machine learning research because the test set no longer represents the unknown, “unseen” setting that it is supposed to. Ideally, we want the recommender models to capture ideas about global trends and use them to give recommendations that are also personalised to the user.

With this we can see how the diversity and disparity in different data splitting methods, each with their own issues, adds to the problem of reporting reliable results. Further, we can now see how model performance is affected by changes in splitting methods despite everything else being the same.

Evaluating recommendation models with different data splits

From the six splitting methods mentioned earlier, leave one last item, leave one last basket, and global temporal are good choices to perform this analysis. User-based temporal split is, in effect, identical to leave one last item as the fractions chosen end up with the latest interactions in the test set. And the user split cannot work with most of the models out there since the evaluation pipeline is very different for it.

Datasets

Considering the nature of the data splitting methods, grocery transaction datasets seem to be the best fit for this analysis. The two chosen datasets contain different users having items/interactions, baskets, and timestamps – some features that our splitting methods use. The two datasets, Tafeng and Dunnhumby, contain around 9,000 and 2,500 users, and approximately 7,800 and 23,000 items respectively.

You can refer to the Evaluation Methodology of the paper for more details on the datasets, data processing, metrics, and the 7 recommendation models chosen for evaluation.

Analysis

We mention a brief overview of their analysis of the results. But, it is worth looking closer at the comparative results of ranking different models on the different split strategies in Table 3 of the paper.

There are four scenarios tested – for each of the two datasets, they report results on two metrics, Normalized Discounted Cumulative Gain (NDCG) and Recall. Overall, their argument is proved since the model rankings change in all four scenarios in at least one of the three data splitting strategies. Changes in rankings show that for the same dataset and metrics, changing the splitting method affected the model performance.

It is interesting how the authors recognize that 7 models is too small a sample size to truly confirm their hypothesis. So, they change hyperparameters and create multiple versions of each model, getting 230 different models. Each model is tested on a pair of splitting strategies for each dataset on the NDCG metric.

The correlation between their performances is measured by Kendall rank correlation coefficient (or Kendall's τ coefficient) – closer to 1.0 value means the rankings are more similar. The resulting coefficients turn out between 0.52 and 0.76 which means that the rankings are quite different. Moreover, some of these scores indicate that splitting strategies change what aspect of the recommendations are evaluated.

With this simple but revealing analysis yields some key points about data splitting strategies – they,

Strongly affect model performance and ranking.
Change what is evaluated in the recommendations.
Reveal that current research models are not directly comparable.

To being countering this prevalent issue, the further analysis recommends that research works in recommendation:

Report their splitting strategies and further statistics.
Evaluate their model using temporal global splitting, the most accepted realistic setting.
Publicly release the data splits for reuse and independent testing.

For a more detailed look at the results and the important charts plotting these metrics, you can go through the original paper, “Exploring Data Splitting Strategies for the Evaluationof Recommendation Models”, cited as,Zaiqiao Meng, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys '20). Association for Computing Machinery, New York, NY, USA, 681–686. https://doi.org/10.1145/3383313.3418479.

Conclusion

In this article, we go over a straightforward but revealing paper that shows how the strategy to split the dataset into train/validation/test sets strongly affects the performance of recommendation models. We present the different dataset splitting strategies commonly used by recommendation systems and their key limitations. And finally, we briefly discuss the analysis of the reported experiments that show the extent to which these methods affect machine learning research into recommendation models.

‍

Is Data Splitting Making or Breaking Your Recommender System?

Diversity in data splitting in recommendation models

Evaluating recommendation models with different data splits

Datasets

Analysis

Conclusion

Get up and running with one engineer in one sprint

Related Posts

Multimodal Alignment for Recommendations

Unlock Text Data: NLP Feature Engineering for Search & Recs

A/B Testing Your Rankings: Metrics That Matter in the Real World