This article provides detailed information about model scoring techniques and formulas used to assign scores and rating to trained and deployed models in the Tealium Predict ML product.

In this article:

Understanding the Quality of Your Model

The quality of any machine learning model created in Tealium Predict or any other machine learning tool varies depending on several factors, including (but not limited to) the following:

  • The attributes that comprise the dataset and how complete of a picture the combined attributes provide.
  • The daily volume of visitors and visits in the dataset. A larger volume means there is more data available for the model to learn on.
  • The training date range selected for the model. Training a model for a longer time period generally equates to more data available for the model to learn on.

To understand whether a given model has high quality for use in the real world, machine learning experts typically use a combination of interrelated metrics. The specific combination varies depending on the type of model being evaluated.

Using Tealium Predict, non-experts can create models, evaluate the quality of the model, and take real-world actions in a streamlined manner, while retaining models with enough transparency for an expert to see the typical metrics they expect.

Strength Scores

The strength scores for each trained version of each model provides a rating of the quality (strength) of the training and the model that resulted from the training. Models not yet deployed are not yet assigned a strength score. Strength scores are listed for both training models in the Training Details screen and for deployed models in the Live Performance screen.

Strength Scores for Trained Models

predictV2_training_strength_score.png

Strength Scores for Deployed Models

predictV2_live_scores.pngEach retraining represents a new and separate event, which makes this type of strength rating static and specific to each version. The rating for a version does not change over time.

Recall

The proportion of true cases in the data that were correctly selected by the model. For example, measuring the model’s ability not to miss high-value visitors. Recall is a helpful metric to reduce false negatives and can be thought of as the capture rate for the behavior you are trying to predict.

The Recall score displays next to each version in the Model Dashboard and in the Training Details screen.

Precision

The proportion of predicted true cases that are actually true. For example, measuring the model’s ability to predict the highest ratio of high-value visitors. Precision is a helpful metric to reduce false positives and can be thought of as the conversion rate for the behavior you are trying to predict.

F1 Score

The weighted average of Precision and Recall. Measures the accuracy of a model. To calculate the F1 Score, Precision and Recall values are input into the following formula:

F1 Score = 2 * ( (Precision * Recall) / (Precision + Recall) )

Accuracy

The measure of how many predicted cases were correct over the total number of predicted cases. Though a common metric, accuracy is not an ideal model health measurement for imbalanced data sets.

Precision vs. Recall Analogy

As an example, assume you have two colors of apples, red and green and your model seeks to predict which apples are red.

If your model has high Precision, this means that the model is usually correct when it predicts an apple is red. If the model creates a list of apples which are supposedly red, high Precision means that this list is mostly accurate and that the apples on the list are actually red.

If your model has high Recall (Sensitivity), the model has the ability to to identify most of the red apples. A model with high recall does a good job of creating a thorough list of the red apples.

Using the same example of red and green apples, the following list describes expected results based on high or low Precision or Recall.

  • High Precision, low Recall = a short list of red apples that is fairly accurate.
  • High Precision, high recall = a longer list of red apples that is fairly accurate
  • Low Precision, low Recall = a short list of red apples that is fairly inaccurate. This list contains more green apples.
  • Low Precision, high Recall = a long list of red apples that is fairly inaccurate. This list contains green apples.

The ideal model clearly should have both high precision and high recall. The concept of potential trade-offs (between the volume of apples predicted to be red, and the accuracy of those predictions) is a recurring theme that impacts how machine learning models are used in the real world.

Additional Information

Public