This document provides detailed descriptions of model strength quality scores.
In this document:
The quality of any machine learning model created using Tealium Predict or any other Machine Learning tool will vary depending on several factors, including (but not limited to) the following:
In order to understand whether a given model is of high enough quality for use in the real world, machine learning experts typically use a combination of interrelated metrics. The specific combination varies depending on the type of model being evaluated.
Tealium Predict enables non-experts to create models, evaluate the quality of the model, and take real-world actions in the most streamlined manner possible, while retaining with enough transparency for an expert to see the typical metrics they expect.
The following sections provides explanations of the metrics and ratings available with Tealium Predict to help you understand the quality of your models. For a more comprehensive explanation of other scoring reports, such as the F1 Score, Confusion Matrix, the ROC/AUC Curve, and Probability Distribution, see Evaluating Success.
In the context of Tealium Predict, accuracy is simply the ratio of correct predictions to the total number of predictions. Accuracy alone can give you a false sense of security and therefore cannot adequately judge the quality of a machine learning model.
As an example, assume that you have red and green apples and 99% of the apples are green. In this case, our model could easily train to obtain a 99% accuracy score by simply predicting that every single apple is green, since that would be correct 99% of the time. This model, with 99% accuracy, would not be a good predictor of which apples are red.
The model strength ratings used in Tealium Predict provide an easy-to-understand rating of the quality (strength) of each version of each trained model. The rating system is comprised of four categorical labels. These labels are based on F1 Score, which is the typical metric used to evaluate the quality of a propensity model.
F1 Score values are categorized using the following scale:
The quality of any model is a relative judgment, not an absolute fact. Different teams have different needs and goals for their models, as well as different levels of sophistication in their modeling and testing abilities and varying quality in their input datasets. For these reasons, model strength ratings are not regarded as absolute. The intention is to use the ratings as a general guideline for quality.
There are two types of strength ratings in Predict: static ratings for each trained version, and ongoing dynamic ratings for each deployed model.
The strength rating for each trained version of each model is intended to provide an easy-to-understand rating of the quality (strength) of the training and the model which resulted from it. Models not yet deployed are not yet assigned a Strength Score.
Training is a one-time event. Each retraining is a new, separate event, which makes this type of strength rating static and specific to each version. The rating for a version does not change over time.
The strength rating is visible next to each version on the Model Explorer page, in the Training Details panel, and in the Tiles view on the Overview page (for the latest trained version).
All machine learning models naturally degrade in quality over time, just as the real world continually changes. Models eventually stagnate and degrade in their ability to make accurate predictions about the changing environment. Models must therefore be retrained periodically to ensure that the model is based on the freshest information available.
For deployed models, Tealium Predict recalculates the ongoing F1 score daily using the most recent time window available. This window is defined in the Prediction Timeframe, which is the number of days specified in the model as "in the next x days". In order for this nightly recalculation to begin, the model must be deployed for the initial Prediction Timeframe (30 days in this example) so that actual true/false results are known and can be used in the calculation. The calculation window then moves forward one day, every day.
The deployed live performance score allows you to understand the ongoing quality of your model and know when it has degraded, which will inevitably happen over time.