3.3 Data Preprocessing

The necessary preprocessing was guided by practical necessity (input errors, inconsistent categories), the requirements of the algorithms that were examined (Section 3.5) and the requirements set out in the cup documentation:

Only numeric features (required by some algorithms)
Imputation of missing values (required by cup documentation)
Removal of constant and sparse features (required by cup documentation)

The transformations were established interactively in Jupyter notebooks. Once finalized, transformations were implemented in the python package kdd98⁶.

3.3.1 Cleaning

The transformations applied can be studied in the Jupyter notebook 1_Preprocessing.ipynb⁷.

The cleaning stage of preprocessing encompassed the following transformations:

Removing noise: Input errors, inconsistent encoding of binary / categorical features
Dropping constant and sparse (i.e. those where only few examples have a value set) features
Imputation of values missing at random (MAR)

MAR values in the sense of Rubin (1976) are missing conditionally on other features in the data. For example, there are three related features from the promotion and giving history: ADATE, the date of mailing a promotion, RDATE, the date of receiving a donation in response to the promotion and RAMOUNT, the amount received. For missing RAMOUNT values, it can be checked if RDATE is non-missing. If RDATE is missing, then the example most likely has not donated and RAMOUNT can be set to zero. If, on the other hand, both date features have a value, RAMOUNT is truly missing.

3.3.2 Feature Engineering

The transformations applied during feature engineering are described in detail in the Jupyter notebook 2_Feature_Engineering.ipynb⁸. The result of this step was an all-numeric data set usable for downstream learning.

All non-numeric (i.e. categorical) features were encoded as numeric representations. For ordinal features, manual mappings from alphanumeric levels to integer numbers were specified. For nominal features, two encoding techniques provided in package categorical-encoding⁹ were employed, depending on the number of levels:

One-hot encoding for \(\leq 10\) levels: For each level of a categorical feature, a new feature is created. An additional feature may be added to indicate missing values. Exactly one of these new features is set to \(1\), indicating the original level.
Binary encoding, \(> 10\) levels: The levels of the categorical are first transformed to ordinal (i.e. to a sequence of integer numbers). Then, these numbers are taken to the base of 2. (A \(5\), for example, becomes 101). According to the number of levels, new features for the binary digits are created. As an example: To represent 60 levels, 6 features are required (\(2^6=64\)).

Also, several features were transformed to better usable formats (i.e. converting dates to time deltas, converting zip codes to coordinates). Care was taken to keep the dimensionality of the data set as low as possible.

3.3.3 Imputation

Three different approaches were evaluated. The details are shown in Jupyter notebook 4_Imputation.ipynb¹⁰.

3.3.3.1 K-Nearest Neighbors

kNN imputation uses those examples that are “near” an example with a missing value in feature \(f\) to impute the value. In short, the kNN algorithm by Troyanskaya et al. (2001) implemented in missingpy¹¹ works as follows:

Construct the distance matrix \(D\) with distances between examples.
Order all features descending by number of missing values.
Starting with the feature with most missing, for each example with a missing value, use the \(k\) nearest neighbors’ mean or median to impute.

The algorithm runs until all values are imputed.

While very attractive because of the intuitive approach and because it preserves data characteristics (continuous / discrete, categoricals) in the features, the distance matrix is very memory-intensive for large data sets. Also, it is required to remove features with > 80 % missing values first.

3.3.3.2 Iterative imputation

Iterative imputation, implemented in package fancyimpute¹², works similar to the R-package mice (see van Buuren and Groothuis-Oudshoorn (2011)). Before imputation, all features have to be transformed to numerical data types and normalized because the underlying model is a linear regresssion.

In short, the algorithm works as follows:

Features are ordered by the fraction of missing values
Starting with the feature with most missing values, use the other features to build a linear model, using the current feature as the dependent variable and predict missing values.
Repeat step 2 until all features are complete
Repeat steps 2 – 3 \(n\) times, \(n=5\) was chosen

3.3.3.3 Simple Imputation and Categorical Indicator

This approach is straightforward: Numeric features are imputed by their median value to make the imputation robust to skewed distributions.

As the algorithm used (sklearn.impute.SimpleImputer) only supports numerical data types, categorical features were treated separately during feature engineering (Section 3.3.2): The one-hot or binary encoded categoricals had one more feature added, indicating missing values.

3.3.4 Feature Selection

One of the biggest caveats in dealing with high-dimensional data is the infamous “Curse of Dimensionality” coined by Bellman (1966). The curse comes from the fact that with an increasing number of dimensions of the feature space, the number of possible combinations grows exponentially. In order to cover all possible combinations with several examples, a huge amount of data would be required as a result. In the area of machine learning, high dimensionality frequently manifests in the form of overfitting, which leads to an unacceptably large generalization error Goodfellow, Bengio, and Courville (2016). Hughes (1968) showed that for a fixed number of examples, model performance first increases with increasing number of dimensions but then decreases again.

Boruta, introduced by Kursa, Rudnicki, and others (2010) in the form of an R package, performs feature selection by solving the so-called all-relevant feature problem. The algorithm was found to perform very well regarding selection of relevant features in Kursa and Rudnicki (2011).

In short, the algorithm works as follows:

The input matrix \(\mathbf{X}\) of dimension \(n \times p\) is extended with \(p\) so-called shadow features. The shadow features are permuted copies of the features in \(\mathbf{X}\). They are therefore de-correlated with the target.
On the resulting matrix \(\mathbf{X^*}\), a random forest classifier is trained and the Z-scores \(\frac{\bar{\text{loss}}}{sd}\) for each of the \(2p\) features calculated.
The highest Z-score among the shadow features \(MZSA\) is determined.
All original features are compared against \(MZSA\) and those features with a higher score selected as important.
With the remaining features, a two-sided test for equality of the Z-scores with \(MZSA\) is performed and all features with significantly lower score are deemed unimportant.
All shadow copies are removed, go to step 1.

The algorithm terminates when all attributes are marked as either important or not important or when the maximum number of iterations is reached.

For this thesis, a python implementation¹³ was used.

The models were all learned on the same data set resulting from the boruta selection.

Available from https://github.com/datarian/thesis-msc-statistics/tree/master/code/kdd98 ↩
https://github.com/datarian/thesis-msc-statistics/tree/master/code/notebooks/1_Preprocessing.ipynb ↩
https://github.com/datarian/thesis-msc-statistics/tree/master/code/notebooks/2_Feature Engineering.ipynb ↩
Available at: https://github.com/scikit-learn-contrib/categorical-encoding/, accessed on 5.6.2019↩
https://github.com/datarian/thesis-msc-statistics/tree/master/code/notebooks/4_Imputation.ipynb ↩
Available at: https://github.com/epsilon-machine/missingpy, accessed on 11.6.2019↩
Available at: https://pypi.org/project/fancyimpute/, accessed on 30.06.2019↩
Available from https://github.com/scikit-learn-contrib/boruta_py, accessed on 5.6.2019↩