5 Conclusions
The effort put into preprocessing and development of the package
kdd98laid a basis for future work on the data set. The architecture of the package makes modifications to existing transformers or implementation of additional transformations time-efficient. Care was taken to document the code so that future users can quickly add their own modifications.- Regarding model learning, the current approach using a two-step prediction is promising in principle. Shortcomings in terms of classifier and regressor performance are suspected to mainly result from the remaining high dimensionality, even after reducing the number features by 90%. A logical next step would be to reiterate over feature engineering, imputation, and feature selection:
- In feature engineering, the promotion- and giving history features should be carefully reexamined. Given the selection of various RFA_ features for promotions in the past through
boruta, these could be distilled into single R, F, A summaries. - Imputation with a CART method like in R’s
caretpackage is another imputation strategy to consider. The simple strategy implemented in this thesis might have led to bias during model learning. - Feature selection using
borutacould be combined with a manual step afterwards, informed by domain knowledge.
- In feature engineering, the promotion- and giving history features should be carefully reexamined. Given the selection of various RFA_ features for promotions in the past through
The \(4^{th}\) rank achieved proves that selection of examples works in principle. The high error on predicted net profit however suggests a thorough reconsideration of the prediction method is necessary. Again, the implementation in
kdd98can serve as a basis, providing the necessary “plumbing” for efficient examination and future work.The result in comparison with the performance of specialized software coupled with expertise in the field employed by the cup winners emphasizes the power of domain knowledge. The progress made in machine learning over the past 20 years since the cup does not automatically lead to superior results when strictly taking only the information contained in the data itself.