This document provides detailed information about model scoring techniques and formulas used to assign scores and rating to deployed models in the Tealium Predict product.

In this document:

Table of Contents Placeholder

Target Attribute Health Ratings

A target attribute is an AudienceStream attribute that defines or signals the visitor behavior that you seek to predict with any Predict model. The target attribute must be either a Boolean/Flag or a Badge type attribute and be Visit or Visitor-scoped.

Healthy vs. Unhealthy Attributes

When you create a model in Predict, potential target attributes from your AudienceStream dataset display. To the right of each attribute, a health rating of “Healthy” or “Unhealthy” displays. This health rating simplifies model creation by clarifying which of your flags and badges are ready or “healthy” and can be used to create models. A rating of “Unhealthy” does not necessarily mean that the attribute is problematic in other contexts. It simply means that this attribute is currently deemed insufficient for successful training of a Predict model.

Machine learning technology in general requires a relatively high volume of data in order to succeed and Machine Learning models provide better results when trained on a large amount of data.

Predict uses the following two factors to define Healthy vs. Unhealthy target attributes:

  • The volume of data for the attribute
  • The distribution of that volume between true and false values.

Both the true and false groups must be above a minimum threshold. For example, for the dates that span the Training Date Range, the median daily counts of True and False visitors must be greater than or equal to 200. This threshold is intentionally set as low as possible to provide the most options possible for target attributes. A model with an Unhealthy target attribute will likely fail in training due to insufficient data for the model to learn on.

If the target attribute you want to use is unhealthy, try one of the following common solutions:

  • Wait for more data to accumulate. The problem often solves itself over a longer time period.
  • Identify ways to drive additional traffic to the AudienceStream data source.
  • Add additional data sources to your AudienceStream profile so that the volume of daily data increases for this target.

Strength Scores (F1 Scores)

An F1 Score is the typical metric used by experts to evaluate the quality of the type of Machine Learning models used by Predict. The F1 Score strikes a balance between two metrics: Precision and Recall.

To calculate the F1 Score, Precision and Recall values are input into the following formula:

F1 Score = 2 * ( (Precision * Recall) / (Precision + Recall) )

High Precision Model

As an example, assume you have two colors of apples, red and green and your model seeks to predict which apples are red. If your model has high Precision, this means that the model is usually correct when it predicts an apple is red. If the model is creating a list of apples which are supposedly red, high Precision means that this list is mostly accurate and that the apples on the list are actually red.

High Recall Model

Using the same example, if your model has high Recall (Sensitivity), the model will be able to identify most of the red apples. A model with high recall does a good job of creating a thorough list of the red apples.

Precision vs. Recall Analogy

Using the same example of red and green apples, the following list describes expected results based on high or low Precision or Recall.

  • High Precision, low Recall = a short list of red apples that is fairly accurate.
  • High Precision, high recall = a longer list of red apples that is fairly accurate
  • Low Precision, low Recall = a short list of red apples that is fairly inaccurate. This list contains more green apples.
  • Low Precision, high Recall = a long list of red apples that is fairly inaccurate. This list contains green apples.

The ideal model clearly should have both high precision and high recall. The concept of potential trade-offs (between the volume of apples predicted to be red, and the accuracy of those predictions) is a recurring theme that impacts how machine learning models are used in the real world.

The Confusion Matrix

A Confusion Matrix is a key tool used to evaluate a trained model. During the training and testing process that runs automatically when you create or retrain a model, Predict attempts to “classify” the visitors during the Training Date Range into two groups: true and false. These two groups reflect whether a user actually did the behavior signalled by the Target Attribute of your model, such as made a purchase or signed up for your email list.

Predict Confusion Matrix.jpgThe Confusion Matrix allows you to easily view the accuracy of these predictions by comparing the true or false prediction value with the true or false actual value. There are four possible scenarios, as described quadrants description below.

This comparison is made possible by the fact that the model is training on historical data (the Training Date Range). Once your model is deployed, the scenario changes. If your deployed model makes a prediction for a particular visitor today and the prediction timeframe is “in the next 10 days”, you will not know for up to 10 days whether the actual value ends up being true or false.

Quadrants of the Confusion Matrix

The four quadrants of the Confusion Matrix are:

  • True Positive
    Correctly predicted true (predicted true and was actually true)
  • True Negative
    Correctly predicted false (predicted false and was actually false)
  • False Positive
    Incorrectly predicted true (predicted true but was actually false)
  • False Negative
    Incorrectly predicted false (predicted false but was actually true)

You can use the values of the quadrants to calculate the two constituent parts of F1 Score (Recall and Precision).

  • Accuracy = (TP + TN) / (TP + FP + FN + TN)
    The accuracy is the number of correct predictions divided by the total number of predictions made.
  • Precision
    Shows the proportion of all positive predictions that were positive.
  • Recall
    Shows the proportion of positive predictions that were correct.

The Confusion Matrix uses a threshold value of 0.5 to differentiate between predicted positive and predicted negative values.

Formula Definition Reference

The following list provides a descriptive reference of elements used in Predict modeling formulas referred to in the Confusion Matrix section and other portions of this document

  • Visitor's Visit (Vn)
    The act of a Visitor visiting a website or the triggering of one or more data layer enriched events.
  • Prediction Time Window (Wn)
    The Prediction Time Window starts at the end of each Visitor's visit, is measured in days, and is used to determine the Prediction Outcome.
  • Prediction (Pn)
    The score generated by a Deployed Model. This score represents the likelihood that the Target Attribute Value will be set to True during the Visitor's next Visit (assuming the next visit occurs within the Prediction Time Window). This value is stored in the corresponding Prediction Attribute for the model.
  • Prediction Attribute
    The numeric Visit scoped data-layer attribute that stores the Prediction value generated by a corresponding Deployed Model. The Prediction Attribute is created by default when a new model is created.
  • Prediction Threshold (PT)
    The numeric threshold value selected to measure a Prediction value against the Target Attribute Value. For example, if the Prediction Threshold chosen is 0.5 and a Prediction value is set to 0.51, it is assumed that the Target Attribute Value will be set to True.
  • Target Attribute
    The Flag or Badge attribute selected to represent the action to be predicted by the model.
  • Target Attribute Value (AVn)
    The True or False value of the Target Attribute at the end of a visitor's visit.
  • Prediction Outcome (O)
    The accuracy of the Prediction as determined by the Target Attribute Value, Prediction Threshold, and Prediction Time Window. If a Visitor returns within the Prediction Time Window, the Prediction Outcome is measured by the Target Attribute Value. If a visitor does not return within the chosen Prediction Time Window, the Target Attribute Value is ignored.
  • Prediction Classification (PC)
    The value, True or False as a result of applying the Prediction Threshold to the Prediction.

The ROC/AUC Curve

Predict ROC.jpgIn Tealium Predict, the ROC/AUC (under the curve) is a performance measurement reported for a trained model in the Model Explorer. In industry terms, the ROC is a true positive rate calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. The ROC describes how well a model predicts the positive class when the actual outcome is positive. The true positive rate is also referred to as Sensitivity.

The Receiver Operating Characteristics (ROC) curve and the area under this curve (referred to as AUC, for area under curve) are common tools in the machine learning community for evaluating the performance of a classification model.

The ROC curve shows the trade-offs between different thresholds and consists of a plot of True Positive Rate (y-axis) against the False Positive Rate (x-axis).

  • True Positive Rate = Sensitivity
  • False Positive Rate = (1 - Specificity)

Ideally, your will be able to distinguish between True and False classes. The model always predicts the correct answer.

ROC Examples

The Probability Distribution and ROC curve for a perfect model would look like the following:

  • For the above ROC curve, AUC = 1. The entire graph area displays underneath and to the right of the red curve (boundary).

For an extreme contrast, below is the Probability Distribution and ROC curve in scenarios where your model always predicts the wrong answer. Always labels True as False, and vice versa.

  • For this ROC curve, AUC = 0.

A poor scenario is defined as a model that is incapable of distinguishing between True and False classes. In this scenario, the Probability Distribution displays two large curves directly on top of each other.

  • The ROC curve is a diagonal line; and AUC = 0.5.

A realistic ROC curve, where 0.5 < AUC < 1.0 displays smaller values on the x-axis of the plot to indicate lower false positives and higher true negatives. Larger values display on the y-axis of the plot to indicate higher true positives and lower false negatives.

When predicting a binary outcome, it is either a correct prediction (true positive) or not (false positive).

Probability Distribution

You can go to the Training Details panel for any version of any trained model to view a probability distribution of the predictions made by the model during training.

Predict Probability Distribution.jpgThe two colored curves of this chart represent the distributions of true and false predictions that the model made during training. Since the model training process uses historical data and you know whether each visitor actually performed the target behavior, it is possible to test the model by comparing the predictions for historical visitors versus the actual outcomes. The purpose of this comparison is to set aside a portion of the training dataset as the test subset.

The probability distribution compares predictions against actual values for the visitors in the test subset. Visitors who were part of the True class (did perform the behavior) are displayed as part of the teal-colored curve and visitors who were part of the False class are part of the orange-colored curve.

Ideal Probability Distribution

The following list describes characteristics of an ideal probability distribution:

  • There is clear separation between the True and False classes, as depicted by the teal and orange curves. This shows the model can easily and accurately classify visitors.
  • Each class (curve) only covers one portion of the predicted value range. The False class us ideally focused in the lower range on the left and the True class is focused in the upper range on the right.

Additional Information