This article provides detailed descriptions of model strength quality scores.
In this article:
The quality of any machine learning model created in Tealium Predict or any other machine learning tool varies depending on several factors, including (but not limited to) the following:
In order to understand whether a given model has high quality for use in the real world, machine learning experts typically use a combination of interrelated metrics. The specific combination varies depending on the type of model being evaluated.
Using Tealium Predict, non-experts can create models, evaluate the quality of the model, and take real-world actions in a streamlined manner, while retaining models with enough transparency for an expert to see the typical metrics they expect.
The following sections provide explanations of the metrics and ratings available when using Tealium Predict that help you understand the quality of your models. For a more comprehensive explanation of other scoring reports, such as the F1 Score, Confusion Matrix, the ROC/AUC Curve, and Probability Distribution, see Model Scores and Ratings.
In Tealium Predict, accuracy equates to the ratio of correct predictions to the total number of predictions. However, accuracy alone cannot be used to determine the quality of a machine learning model.
As an example, assume that you have red and green apples and 99% of the apples are green. In this case, our model could easily train to obtain a 99% accuracy score by simply predicting that every single apple is green, since that would be correct 99% of the time. This model, with 99% accuracy, would not be a good predictor of which apples are red.
The model strength ratings provide a label for the quality (strength) of a version. There are two types of strength ratings, deployed and training. The deployed strength is an ongoing (dynamic) rating for each deployed model. The training strength is a static ratings for each trained version.
The rating system for deployed versions has four labels that are based on the Precision and Recall Scores and are labeled using the following scale:
The quality of a model is a relative judgment, not an absolute fact. The makeup of each team differs as do goals, testing abilities, and dataset quality. For these reasons, model strength ratings are not regarded as absolute. The intention is to use the ratings as a general guideline for quality.
The strength rating for each trained version of each model provides a rating of the quality (strength) of the training and the model that resulted from the training. Models not yet deployed are not yet assigned a strength score.
The rating system for training versions has four labels that are based on the Precision and Recall Scores and are labeled using the following scale:
Each retraining represents a new and separate event, which makes this type of strength rating static and specific to each version. The rating for a version does not change over time.
The strength rating displays next to each version on the Model Explorer page, in the Training Details panel, and in the overview page (for the latest trained version).
The deployed live performance score allows you to understand the ongoing quality of your model and know when it has degraded, which inevitably happens over time.
In Tealium Predict, the F1 score is automatically recalculated daily for deployed models using the most recent time window available. This window is defined in the Prediction Timeframe, which equates to the number of days specified in the model as "in the next x days". In order for the daily recalculation to begin, the model must be deployed for the initial Prediction Timeframe so that actual true/false results are known and can be used in the calculation. The calculation window then moves forward one day, every day.
All machine learning models naturally degrade in quality over time, just as the real world continually changes. Models eventually stagnate and degrade in their ability to make accurate predictions about the changing environment. Therefore, models must be retrained periodically to ensure that they are based on the most current dataset available.