M9715027 鍾至衡

In this project KDD Cup 2009 we have two datasets, training dataset and testing dataset. Both of them have 50000 examples which each one have 14740 numerical variables and 260 categorical variables. There are three target values: Churn, Appetency and Up-selling, and our task is using the training dataset with given training target values to train a model, and use it to predict the target values of the test dataset.

1. The first view of the dataset

Because There is no other useful information provided, at the first I try to do some observations only on the given dataset. The observations are listed below:

(A)   The first, this dataset has 15000 variables. This is a huge number that it’s hard to do basic analysis only by eyes.

(B)   The meanings of the variables are not given (they are only named by “Var?????”), so we can’t do any preprocess according to their meanings directly.

(C)   This dataset has missing values.

(D)   There are both numerical and categorical variables in it.

Above points are the basic definitions of this dataset. And there are more  observations:

(E)   There are so many zeros in this dataset.

(F)   The nonzero numerical values have big scale. There are some values greater than 108, and other some values lesser than -106. There are also many floating point numbers.

(G)   I can’t understand the meanings of categorical variables (which show “a5n2” or “_lnU”, etc), but there are many missing values in categorical variables.

These observations can be using to provide us some ideas for preprocessing.

2.Problems and ideas

According to the above observations, we can find out some problems and ideas:

(A)   From 1.(E), this property can help us in two parts:

a. It can save the memory space if we only store the nonzero values with their column numbers. This may be helpful in such a large dataset.

b. We can count the numbers of nonzero and zero of all variables. If a variable usually have nonzero values, according to their distribution the zeros may be outliers. If a variable is usually zero, the “zero” or “nonzero” may be a important feature of this variable.

(B)   We need to do normalization because of 1.(F) or we can’t get good result. Because of 1.(B), we can’t do normalization according to the meanings of variables. So, we can only do statistics to get the scale of each variable.

(C)   According to 1.(G), I think that the missing of categorical variables is meaningful. I will set the missing values of categorical variables to “?” values, and put the average (or other ways) value in the missing values of numerical variables.

3.Steps

Above problems and ideas tell us some steps can be done (but there are something pending):

(A) Preprocessing

a. Write a program to do some statistics (count the number of zero/nonzero values, find out the distribution of each variable).

b. According to the statistics result, fill in the missing values, remove the outliers and then do the normalization.

c. How to preprocess the categorical values?

d. Maybe apply Principal Component Analysis  for decreasing the number of variables?

(B)Modeling

a. Of course we use Neural Networks.

b. Should we use specific hiding nodes only for numrical variables and only for categorical variables?

4.References

 T. Cox and M. Cox. Multidimensional Scaling. 1994.