3.4 Prediction

The desired quantity to estimate is net profit \(\hat{\Pi}\). In order to predict this quantity, a two-step procedure was implemented, utilizing the response indicator \(\text{TARGET}_B\) and the continuous target \(\text{TARGET}_D\), respectively. For each step, a model was trained. One is a classifier, predicting \(\hat{y}_b\), the probability of donating. The other is a regressor, predicting the donation amount \(\hat{y}_d\). The classifier was trained on the complete learning data set, while the regressor was trained on \(\mathbf{X}_d = \{x_i|y_{b,i} = 1, i=1 \ldots n\}\) and \(y_d = \{y_{d,i}|y_{b,i} = 1, i=1 \ldots n\}\). The non-random sample leads to a conditionally predicted donation amount, thereby introducing bias. A correction for the bias is necessary as a consequence. This approach resembles Heckman (1976)’s two-stage procedure which is widely used in econometrics. Heckman’s procedure was presented for the question of wage offerings for women. Data on wages was only available for working women, thereby introducing bias if wage offerings were only predicted on this sample. The question of whether an example works or not is seen as an unobserved feature. In Heckman (1976), a probit model is used for this first selection stage. The inverse Mills ratio of the probit, \(\frac{-\phi(\hat{y_i}-\mathbf{X}\beta)}{\Phi(\hat{y_i}-\mathbf{X}\beta}\), is calculated and included in the data for predictions in the second observation stage with OLS regression.

3.4.1 Setup of the Two-Stage Prediction

For the first stage some classifier is used, predicting the probability for example \(x_i\) to donate. This is not a probit (\(P(Y=1|X) = \Phi(X^T\beta)\)), of course. Instead, the resulting distribution depends on the classifier.

\[\begin{equation} \hat{y}_b = f(\mathbf{X}) \tag{2.1} \end{equation}\]

where \(\hat{y}_b\) is the vector of predicted probabilities of donating and \(f\) is the classifier.

The second stage is performed on \(\mathbf{X}_d\) and consists in predicting the donation amount using a regression model:

\[\begin{equation} \hat{y}_{dt} = g(\mathbf{X_d}) \tag{2.2} \end{equation}\]

where \(\mathbf{\hat{y}_{dt}}\) is the vector of predicted donation amounts and \(g\) is the conditionally learned regression model. \(g\) is learned with Box-Cox transformed target \(\mathbf{y_d}\), so \(\hat{y}_{dt}\) is also Box-Cox transformed with parameter \(\lambda\).

The decision of whether to include example \(i\) in the promotion is governed by the following indicator function. In it, \(\alpha^*\) accounts for the introduced bias. Every example that has a predicted donation amount of more than the unit cost is included.

\[\begin{equation} \mathbb{1}_{\hat{y}_{i,b} * \exp(\hat{y}_{i,dt}) * \alpha^* > \exp(u_t)}(\hat{y}_{i,dt}) \tag{2.3} \end{equation}\]

where \(\alpha^* \in [0,1]\) is a factor to correct for bias introduced due to the non-randomness of \(\mathbf{X}_d\), \(\hat{y}_{i,dt}\) is the predicted donation amount, transformed so as to normalize the distribution learned beforehand and \(u_t\) is the unit cost, Box-Cox transformed with parameter \(\lambda\). The exponential is used to deal with negative values resulting from the Box-Cox transformation.

Finally, the quantity estimated is net profit \(\hat{\Pi}\). It is calculated by summing over the product of the indicator function (2.3) and the net profit for examples \(1 \ldots n\). For unseen data, @eq:pi-alpha is altered by using the estimated net profit \(\hat{y}_{d,i}\) in the product.

\[\begin{equation} \hat{\Pi}_\alpha = \sum_{i=1}^n \mathbb{1}_{\hat{y}_{i,b} * \exp(\hat{y}_{i,dt}) * \alpha^* > \exp(u_t)}(\hat{y}_{i,dt})*(y_{i,d} - u) \tag{2.4} \end{equation}\]

3.4.2 Optimization of \(\alpha^*\)

With equation (2.4), the estimated profit \(\Pi\) was calculated on the learning data for a grid of \(\alpha\) values, \(\alpha \in [0,1]\). The optimal value was then \(\alpha^{*} = \underset{\alpha}{\operatorname{argmax}} f(\alpha)\) where \(f\) is a function that was fit to \(\Pi\).

For \(f\), a cubic spline \(s\) was used. \(\alpha^*\) was determined as follows:

Fit \(s(\Pi)\), the cubic spline on the estimated profits for the grid of \(\alpha\) values
Derive \(ds = \frac{\delta}{\delta \alpha} s\)
Find the finite roots of \(ds\), \(\alpha_{\text{candidates}}\), representing candidates for \(\alpha^*\)
Determine \(\alpha^* = \underset{\alpha}{\operatorname{argmax}} s(\alpha_{\text{candidates}})\)