4.1 Preprocessing With Package kdd98

The self-written package kdd98 ensures consistent data provisioning. It handles downloading and preprocessing of the data set for both the learning and validation set. All preprocessing steps are trained on the learning data set. The individual trained transformations are persisted on disk. After training, the transformations can be applied on the validation data. This process is transparent to the user. It is enough to instantiate the data provider for either learning or validation data set and request the data. Examples for usage can be found in the Jupyter notebooks.

The data sets can be obtained at intermediate steps from kdd98.data_handler.KDD98DataProvider:

raw, as imported through pandas.read_csv()
preprocessed, input errors removed, correct data types for all features, missing at random (MAR) imputations applied
numeric, after feature engineering (encoded categories, date and zip code transformations)
imputed, with missing values imputed
all-relevant, filtered down to a set of relevant features

For some transformers, behavior can be controlled by specifying parameters (which has to be done in the code). The package’s architecture makes it easy to implement additional transformation steps.

The source code, along with a short introduction, is available online¹⁶.

https://github.com/datarian/thesis-msc-statistics/tree/master/code/kdd98 ↩