Bubble tea🧋 and RecSys metrics
The first thing you need to know is that all these metrics are used to provide a quantifiable measure of performance. Let’s imagine that you are a machine learning developer, building an app for the delivery of bubble tea, these days there are many different vendors offering a wide variety of options. But you want to make sure your users will buy the tea from your app, to make it happen you want to suggest the best bubble tea options based on their preferred flavors, tea type, sugar level, and time of delivery (gotta keep it fresh). All of this can be framed as a typical recommendation problem.
And for simplicity let’s assume that you are just starting out so you have only two types of bubble tea, one that has tapioca balls and another coconut jelly. After tracking and recording user orders you build a dataset of customers who prefer these two types of tea. Next, you want to classify them so that your general recommendation model can infer from user data that it would prefer bubble tea chains and products that contain tapioca. This type of problem is known as binary classification, in simple RecSys terms the two things you are trying to predict are users’ likes and dislikes.
In the real world, this task gets more complex but these evaluation metrics actually work in the same way for a multitude of options and scenarios. So let’s take our example and explain all the metrics using it.
ROC - Receiver Operating Characteristic
We have trained our model and now we need to test it out, so where do we start? We begin at ROC. ROC stands for Receiver Operating Characteristic. It's a graphical representation of the performance of a binary classification model. It plots the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings.
To break it down let’s explain those terms:
TPR is the percentage of correctly predicted positive examples out of all the actual positive examples. Here our model will be correctly predicting that our user likes the type of bubble tea we selected.
Now what about FPR?
FPR is the percentage of incorrectly predicted positive examples out of all the actual negative examples. So in our case, the model will predict the wrong type of bubble tea that the customer likes.
As we can see both FPR and TPR relate to each other and can be calculated in different ways.
A perfect model would have a ROC curve that hugs the top-left corner of the plot, meaning that it would have a high TPR and a low FPR at all threshold settings. A model that makes random predictions would have a ROC curve that is a diagonal line from the bottom-left to the top-right corner of the plot, meaning that its TPR and FPR would be equal at all threshold settings. A threshold in this case is the value above which the prediction becomes positive or negative depending on your setup. For example threshold >= 0.5 means that every sample that gets above 50% for the target class becomes this class.
Threshold settings are used to adjust the balance between true positives and false positives by changing the criteria that determine when an example is classified as positive. These adjustments are reflected in the ROC curve which is a graphical representation of the performance of a classification model.
Here we can see the ROC curve and a dashed line labeled random. This represents ROC curve of a classifier that makes random predictions.
AUC - Area Under the Curve
ROC seems sufficient so why use AUC? AUC or Area Under the Curve summarizes ROC across all thresholds and is therefore the area under the ROC curve.
AUC is calculated by finding the area under the curve of the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings and combining them into a single score.
A perfect model would have an AUC of 1, meaning that all the positive examples would be ranked higher than all the negative examples, while a model that makes random predictions would have an AUC of 0.5. This detail is important as this makes AUC a better metric than a plain accuracy score which is just a percentage of correct predictions against all the possible inputs in the dataset.
AUC can be interpreted as the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example. Below we can see how AUC looks like when plotted, it is the shaded blue area under the ROC curve. If we were to shade everything below random line we would get AUC of 0.5 → a result signifying random predictions coming from our classifier model.
This understanding of AUC is traditional and is very suitable for classification. However, this approach is not ideal for most recommendation systems and requires adaptation. If you were to look up implementations of AUC online, you would likely find that two things the method would require are ids and scores. In essence this would be the true ids of your items, and the scores meaning how likely the item was picked or predicted (note how the idea of prediction scores is more relevant for classification objective). For ranking, instead, we often get the predicted ids of items in order that the model also attempts to predict (from most relevant to least) and the true ids which is the true ranking, i.e. user preferences. So to redefine relevance we need to incorporate the measure of relevant items in correct order. We can do it like so:
In the numerator we have a sum over an indicator function that equals 1 if the score of irrelevant item t0 is less than the score of relevant item t1, and 0 otherwise. Note that the score s(t) is calculated based on the inverse rank of the item in the predicted list. Higher rank (closer to the top) results in a higher score.
M0 is the set of irrelevant items that are present in the predicted list but not in the actual relevant set, M1 is the set of relevant items that are present both in the predicted list and the actual relevant set. Hence, in our version, the AUC value is computed by counting how often a randomly selected relevant item is ranked higher than an irrelevant item, normalized by the total possible pairs of relevant and irrelevant items. The formula defaults to 0.5 when no relevant or no irrelevant items are found in the predicted list, indicating no discrimination between relevant and irrelevant items.
Recalling it all with Precision-Recall
We can clearly see now that all of these metrics are related. It is directly because they all rely on the same common variables like TP, TN, FP, and FN. So what’s so different about precision-recall?
In essence, the task PR fulfills is very similar but with a slight difference. While ROC and AUC look to measure the ability of the classifier to separate positive and negative examples and are a good choice when the dataset is balanced (equal number of both classes), PR is looking to measure a model's ability to identify positive samples while minimizing false positives at the same time. It is considered a better choice for imbalanced datasets and a good option when you are more interested in positive examples.
To better understand this concept let’s recall the equations for TPR and FPR above and see the difference between them and equations for precision and recall:
And for recall:
Notice anything common? That’s right! TPR and Recall are the same. There is one more important detail, notice how true negatives are missing from the equations above? As mentioned PR focuses on the positive examples as under this metric they are of the bigger interest to us.
A perfect model would have a precision of 1 and a recall of 1, meaning that all of the positive examples are correctly predicted and all of the actual positive examples are predicted as positive. In practice, there is a trade-off between precision and recall, and a model with high precision might have a low recall and vice versa. PR helps to understand how well the model is able to identify relevant examples while minimizing the number of irrelevant examples.
This sums up our journey into popular ML metrics!