KDD Cup 2009
In this project gKDD Cup 2009h we have two datasets, training dataset and testing dataset. Both of them have 50000 examples which each one have 14740 numerical variables and 260 categorical variables. There are three target values: Churn, Appetency and Up-selling, and our task is using the training dataset with given training target values to train a model, and use it to predict the target values of the test dataset.
1. The first view of the dataset
Because There is no other useful information provided, at the first I try to do some observations only on the given dataset. The observations are listed below:
(A) The first, this dataset has 15000 variables. This is a huge number that itfs hard to do basic analysis only by eyes.
(B) The meanings of the variables are not given (they are only named by gVar?????h), so we canft do any preprocess according to their meanings directly.
(C) This dataset has missing values.
(D) There are both numerical and categorical variables in it.
Above points are the basic definitions of this dataset. And there are more observations:
(E) There are so many zeros in this dataset.
(F) The nonzero numerical values have big scale. There are some values greater than 108, and other some values lesser than -106. There are also many floating point numbers.
(G) I canft understand the meanings of categorical variables (which show ga5n2h or g_lnUh, etc), but there are many missing values in categorical variables.
These observations can be using to provide us some ideas for preprocessing.
2.Problems and ideas
According to the above observations, we can find out some problems and ideas:
(A) From 1.(E), this property can help us in two parts:
a. It can save the memory space if we only store the nonzero values with their column numbers. This may be helpful in such a large dataset.
b. We can count the numbers of nonzero and zero of all variables. If a variable usually have nonzero values, according to their distribution the zeros may be outliers. If a variable is usually zero, the gzeroh or gnonzeroh may be a important feature of this variable.
(B) We need to do normalization because of 1.(F) or we canft get good result. Because of 1.(B), we canft do normalization according to the meanings of variables. So, we can only do statistics to get the scale of each variable.
(C) According to 1.(G), I think that the missing of categorical variables is meaningful. I will set the missing values of categorical variables to g?h values, and put the average (or other ways) value in the missing values of numerical variables.
I write all program by myself without any exist library . Basically I use Backpropagation Neural Network, with below ideas and modifications:
l A program for check the zero, nonzero, missing, min value, max value, average and standard deviation of the training data.
l Use the above information to normalize the all value to [-1, 1].
l Another program to catch all categorical values and using hash table to keep them.
l In order to show the categorical values are at the same rank, using gmulti-weighth for categorical input. Each categorical input value uses different weight for backpropagation.
l To avoid no enough memory problem, store data as (position, value) and discard missing value. This can save about 3/5 memory space (large case).
l Although I do something, the memory space is still a little not enough.
l So huge data takes a long time for training.
 T. Cox and M. Cox. Multidimensional Scaling. 1994.
 Some examples source code at http://www.neural-networks-at-your-fingertips.com/