**KDD Cup 2009**

**M9715027
****฿t**

**๊Aคvแ`pถEvF**

In
this project gKDD Cup 2009h
we have two datasets, training dataset and testing dataset. Both of them have
50000 examples which each one have 14740 numerical variables and 260
categorical variables. There are three target values: Churn, Appetency and
Up-selling, and our task is using the training dataset
with given training target values to train a model, and use it to predict the
target values of the test dataset.

**๑Aคvแ`內eF**

1. The
first view of the dataset

Because There is no other
useful information provided, at the first I try to do some observations only on
the given dataset. The observations are listed below:

*(A) The
first, this dataset has 15000 variables. This is a huge number that it**fs hard to do
basic analysis only by eyes.*

*(B) The
meanings of the variables are not given (they are only named by **gVar?????h), so we canft do any preprocess according
to their meanings directly.*

*(C) This dataset has
missing values.*

*(D) There are both
numerical and categorical variables in it.*

Above points are the basic
definitions of this dataset. And there are more observations:

*(E) There are so many
zeros in this dataset.*

*(F) The
nonzero numerical values have big scale. There are some values greater than 10 ^{8},
and other some values lesser than -10^{6}. There are also many floating
point numbers.*

*(G) I can**ft understand the
meanings of categorical variables (which show ga5n2h or g_lnUh, etc),
but there are many missing values in categorical variables.*

These observations can be using to provide us some ideas for
preprocessing.

2.Problems and ideas

According
to the above observations, we can find out some problems and ideas:

* (A) From
1.(E), this property can help us in two parts:*

*a. It can save the
memory space if we only store the nonzero values with their column numbers.
This may be helpful in such a large dataset.*

*b. We can count the
numbers of nonzero and zero of all variables. If a variable usually have
nonzero values, according to their distribution the zeros may be outliers. If a
variable is usually zero, the **gzeroh or gnonzeroh may be a important feature of this
variable.*

*(B) We need
to do normalization because of 1.(F) or we can**ft get good
result. Because of 1.(B), we canft do normalization
according to the meanings of variables. So, we can only do statistics to get
the scale of each variable.*

*(C) According
to 1.(G), I think that the missing of categorical variables is meaningful. I
will set the missing values of categorical variables to **g?h values, and put the average (or other ways) value in the missing
values of numerical variables.*

3.My methods

I
write all program by myself without any exist library . Basically I use Backpropagation
Neural Network, with below ideas and modifications:

l A program for check the zero,
nonzero, missing, min value, max value, average and standard deviation of the training
data.

l Use the above information to
normalize the all value to [-1, 1].

l Another program to catch all categorical
values and using hash table to keep them.

l In order to show the categorical values
are at the same rank, using gmulti-weighth for categorical input. Each categorical
input value uses different weight for backpropagation.

l To avoid no enough memory problem, store
data as (position, value) and discard missing value. This can save about 3/5
memory space (large case).

4.Difficulty

l Although I do something, the memory
space is still a little not enough.

l So huge data takes a long time for
training.

5.References

**[1] T. Cox and M. Cox. Multidimensional Scaling. 1994.**

**[2] Some examples source code at http://www.neural-networks-at-your-fingertips.com/**