This article provides detailed information about model scoring techniques and formulas used to assign scores and rating to trained and deployed models in the Tealium Predict ML product.
In this article:
The quality of any machine learning model created in Tealium Predict or any other machine learning tool varies depending on several factors, including (but not limited to) the following:
To understand whether a given model has high quality for use in the real world, machine learning experts typically use a combination of interrelated metrics. The specific combination varies depending on the type of model being evaluated.
Using Tealium Predict, non-experts can create models, evaluate the quality of the model, and take real-world actions in a streamlined manner, while retaining models with enough transparency for an expert to see the typical metrics they expect.
The strength scores for each trained version of each model provides a rating of the quality (strength) of the training and the model that resulted from the training. Models not yet deployed are not yet assigned a strength score. Strength scores are listed for both training models in the Training Details screen and for deployed models in the Live Performance screen.
Strength Scores for Trained Models
Strength Scores for Deployed Models
Each retraining represents a new and separate event, which makes this type of strength rating static and specific to each version. The rating for a version does not change over time.
The proportion of true cases in the data that were correctly selected by the model. For example, measuring the model’s ability not to miss high-value visitors. Recall is a helpful metric to reduce false negatives and can be thought of as the capture rate for the behavior you are trying to predict.
The Recall score displays next to each version in the Model Dashboard and in the Training Details screen.
The proportion of predicted true cases that are actually true. For example, measuring the model’s ability to predict the highest ratio of high-value visitors. Precision is a helpful metric to reduce false positives and can be thought of as the conversion rate for the behavior you are trying to predict.
The weighted average of Precision and Recall. Measures the accuracy of a model. To calculate the F1 Score, Precision and Recall values are input into the following formula:
F1 Score = 2 * ( (Precision * Recall) / (Precision + Recall) )
The measure of how many predicted cases were correct over the total number of predicted cases. Though a common metric, accuracy is not an ideal model health measurement for imbalanced data sets.
As an example, assume you have two colors of apples, red and green and your model seeks to predict which apples are red.
If your model has high Precision, this means that the model is usually correct when it predicts an apple is red. If the model creates a list of apples which are supposedly red, high Precision means that this list is mostly accurate and that the apples on the list are actually red.
If your model has high Recall (Sensitivity), the model has the ability to to identify most of the red apples. A model with high recall does a good job of creating a thorough list of the red apples.
Using the same example of red and green apples, the following list describes expected results based on high or low Precision or Recall.
The ideal model clearly should have both high precision and high recall. The concept of potential trade-offs (between the volume of apples predicted to be red, and the accuracy of those predictions) is a recurring theme that impacts how machine learning models are used in the real world.