A.2 Package kdd98
The source code is at: https://github.com/datarian/thesis-msc-statistics/tree/master/code/kdd98
A.2.1 Usage
A.2.1.1 Data Provisioning
from kdd98 import data_handler as dh
provider = dh.KDD98DataProvider(["cup98LRN.txt", "cup98VAL.txt"])
# Get data at several intermediate steps as necessary
provider.raw_data
provider.preprocessed_data
provider.numeric_data
provider.imputed_data
provider.all_relevant_data
By requesting all_relevant_data, all transformation steps are applied.
Request data for cup98LRN.txt first to learn all transformers involved.
Only then can the provider be instantiated with cup98VAL.txt and the exact same transformations applied to that data set.
A.2.1.2 Predictions
from kdd98.prediction import Kdd98ProfitEstimator
# These are found beforehand and expected to implement scikit-learn's API
classifier = SomeClassifier() # predicts donation probability
regressor = SomeRegressor() # conditionally predicts donation amount
estimator = Kdd98ProfitEstimator(classifier, regressor)
estimator.fit(X_learn, y_learn)
# Returns a tuple (indicator, net_profit):
estimator.predict(X_test, y_test)
A.2.2 Installation
Clone the repository, then, from the base directory:
cd code
python setup.py install # for direct install
# for a conda package to be installed manually afterards:
python setup.py bdist_conda
A.2.2.1 Project structure
After installing the package, several folders and files are created in the working directory when using it:
datacontains the data files and documentation of the original KDD Cup 1998data/data_framescontains pickled pandas df’s with data at the corresponding transformation stepmodels/internalcontains persisted transformers for data set transformations.out.logreports transformation progress / potential problems