**九十七學年度下學期**** ****類神經網路**** ****研究計畫書**

**M9715027 ****鍾至衡**

**一、研究計畫中英文摘要：**

In this project “KDD
Cup 2009” we have two datasets, training
dataset and testing dataset. Both of them have 50000 examples which each one have
14740 numerical variables and 260 categorical variables. There are three target
values: Churn, Appetency and Up-selling, and our task is using the training
dataset with given training target values to train a model, and use it to predict
the target values of the test dataset.

**二、研究計畫內容：**

1. The first view of the
dataset

Because There is no other useful
information provided, at the first I try to do some observations only on the
given dataset. The observations are listed below:

*(A) The first, this
dataset has 15000 variables. This is a huge number that it’s hard to do basic
analysis only by eyes.*

*(B) The meanings of the
variables are not given (they are only named by “Var?????”), so we can’t do any
preprocess according to their meanings directly.*

*(C) This dataset has missing values.*

*(D) There are both numerical and
categorical variables in it.*

Above points are the basic
definitions of this dataset. And there are more observations:

*(E) There are so many zeros in this
dataset.*

*(F) The nonzero numerical
values have big scale. There are some values greater than 10 ^{8}, and
other some values lesser than -10^{6}. There are also many floating
point numbers.*

*(G) I can’t understand
the meanings of categorical variables (which show “a5n2” or “_lnU”, etc), but
there are many missing values in categorical variables.*

These observations can be
using to provide us some ideas for preprocessing.

2.Problems and ideas

According
to the above observations, we can find out some problems and ideas:

* (A) From 1.(E), this property can help
us in two parts:*

*a. It can save the memory space if we
only store the nonzero values with their column numbers. This may be helpful in
such a large dataset.*

*b. We can count the numbers of
nonzero and zero of all variables. If a variable usually have nonzero values, according
to their distribution the zeros may be outliers. If a variable is usually zero,
the “zero” or “nonzero” may be a important feature of this variable.*

*(B) We need to do
normalization because of 1.(F) or we can’t get good result. Because of 1.(B),
we can’t do normalization according to the meanings of variables. So, we can
only do statistics to get the scale of each variable.*

*(C) According to 1.(G), I
think that the missing of categorical variables is meaningful. I will set the
missing values of categorical variables to “?” values, and put the average (or
other ways) value in the missing values of numerical variables.*

3.Steps

Above problems
and ideas tell us some steps can be done (but there are something pending):

(A) Preprocessing

a. Write a program to do some statistics
(count the number of zero/nonzero values, find out the distribution of each
variable).

b. According to the statistics result,
fill in the missing values, remove the outliers and then do the normalization.

c. How to preprocess the categorical
values?

d. Maybe apply Principal Component
Analysis [1] for decreasing the number of variables?

(B)Modeling

a. Of course we use Neural Networks.

b. Should we use specific hiding nodes
only for numrical variables and only for categorical variables?

4.References

**[1] T. Cox and M. Cox. Multidimensional Scaling. 1994.**