5 Conclusions

The effort put into preprocessing and development of the package kdd98 laid a basis for future work on the data set. The architecture of the package makes modifications to existing transformers or implementation of additional transformations time-efficient. Care was taken to document the code so that future users can quickly add their own modifications.
Regarding model learning, the current approach using a two-step prediction is promising in principle. Shortcomings in terms of classifier and regressor performance are suspected to mainly result from the remaining high dimensionality, even after reducing the number features by 90%. A logical next step would be to reiterate over feature engineering, imputation, and feature selection:
- In feature engineering, the promotion- and giving history features should be carefully reexamined. Given the selection of various RFA_ features for promotions in the past through boruta, these could be distilled into single R, F, A summaries.
- Imputation with a CART method like in R’s caret package is another imputation strategy to consider. The simple strategy implemented in this thesis might have led to bias during model learning.
- Feature selection using boruta could be combined with a manual step afterwards, informed by domain knowledge.
The \(4^{th}\) rank achieved proves that selection of examples works in principle. The high error on predicted net profit however suggests a thorough reconsideration of the prediction method is necessary. Again, the implementation in kdd98 can serve as a basis, providing the necessary “plumbing” for efficient examination and future work.
The result in comparison with the performance of specialized software coupled with expertise in the field employed by the cup winners emphasizes the power of domain knowledge. The progress made in machine learning over the past 20 years since the cup does not automatically lead to superior results when strictly taking only the information contained in the data itself.