Research on KDD Cup 2009 Based on BP Neural Network
D9715009 Hsu Fu-Yuan
kevin@bctest.ntnu.edu.tw
1.
Introduction
KDD Cup is the well-known data mining competition of the
annual ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining (KDD),
the leading professional organization of data miners. Customer Relationship Management (CRM) is a key element
of modern marketing strategies. This research will use the KDD Cup 2009 (http://www.kddcup-orange.com/)
dataset which is a large marketing databases from the French Telecom company
Orange to predict the propensity of customers to switch provider (churn), buy
new products or services (appetency), or buy upgrades or add-ons proposed to
them to make the sale more profitable (up-selling).
This research need to deal with a very large
database, including heterogeneous noisy data (numerical and categorical
variables), and unbalanced class distributions. Time efficiency is often a
crucial point. Therefore part of the KDD Cup 2009 will be time-constrained to
test the ability of the participants to deliver solutions quickly.
2.
Related work
2.1 CRM Applications with data mining techniques
E.W.T. Ngai et al. (2009) review an academic database of literature
between the period of 2000¡V2006 covering 24 journals and proposes a
classification scheme to classify the articles. Nine hundred articles were
identified and reviewed for their direct relevance to applying data mining
techniques to CRM. Eighty-seven articles were subsequently selected, reviewed
and classified. Each of the 87 selected papers was categorized on four CRM
dimensions (Customer Identification, Customer Attraction, Customer Retention
and Customer Development) and seven data mining functions (Association, Classification,
Clustering, Forecasting, Regression, Sequence Discovery and Visualization).
S.-Y.Hung
et al. (2006) summarizes some data mining functionalities, techniques, and
applications in the CRM space, see Table1 below.
Table 1: Data Mining Functionalities,
Techniques, and CRM Applications
2.2 Artificial
neural network
Frank Rosenblatt's (1969) perceptron was one of the earliest neural
networks models. The perceptron consists of the weights, the
summation processor, and the adjustable threshold processor.
Figure 1 Artificial Neuron,
PERCEPTRON
In recent year, people pay much
attention to ANN because of its distinctive information disposal ability and
its unique calculation ability. This makes ANN have broad application prospect.
ANN has some features that other models don¡¦t process. The merit of ANN model
just compensates for the insufficient aspect of early-warning system exited
now. The appearance of ANN theory and method offers the possibility of the
early-warning system to get over the insufficient of traditional way.
2.3 Backpropagation network algorithm
Rumelhart
et. al. (1968) introduced the back-propagation
algorithm (abbreviated as BP network) which is a multilayer perceptron with a learning rule called
generalised delta rule and a sigmoid transfer function.
Until 1990, BPN had become one of the most popular and highly utilized neural network model.
It is an example of a supervised neural network. It has
very strong ability of simulation of non-linear system and it¡¦s the artificial
neural network that has been most widespread applied. The characteristic of BP
algorithm is that the signal calculates forward and back-propagation of the
error. It shows the characteristic of signal flow direction of algorithm in
Figure.2
3.
Research methodology
3.1 Procedure
This research will divide into four steps: (1) Data
cleaning & preprocessing, (2) Data reduction & transformation, (3) Classification,
(4) Evaluation.
Data cleaning & preprocessing include sample
quality control and assurance, normalization, peak detection, peak
alignment¡Ketc. Preprocessing is necessary for further analysis. Data reduction
& transformation want to filter out data noise and select most
discriminative peaks from samples. Several statistical methods are applied here.
After data reduction & transformation,
classification algorithm will builds a model that discriminate normal samples
from patient. Many machine learning algorithms are applied, such as decision
tree, support vector machine, neural network¡Ketc, This research will select
neural network (back propagation network, BPN) as data mining techniques to
build predictive models. Finally, we will give an evaluation for result.
3.2 Schedule
3.3 Research tool
We¡¦ll use MATLAB 2008a with Neural Network
Toolbox as modeling tool to design and simulate neural networks. Neural Network
Toolbox extends MATLAB with tools for designing, implementing, visualizing, and
simulating neural networks. Neural networks are invaluable for applications
where formal analysis would be difficult or impossible, such as pattern
recognition and nonlinear system identification and control. Neural Network
Toolbox software provides comprehensive support for many proven network
paradigms, as well as graphical user interfaces (GUIs) that enable you to
design and manage your networks. The modular, open, and extensible design of
the toolbox simplifies the creation of customized functions and networks.
Figure.3 The Neural Network Fitting Tool (top) and a performance plot.
3.4 Evaluation
The performances are evaluated according to the
arithmetic mean of the AUC for the three tasks (churn, appetency. and
up-selling). The prediction of each target
variable is thought of as a separate classification problem. The results of
classification, obtained by thresholding the prediction score, may be
represented in a confusion matrix, where tp (true positive), fn (false
negative), tn (true negative) and fp (false positive) represent the number of
examples falling into each possible outcome:
Prediction |
|||
Class +1 |
Class -1 |
||
Truth |
Class +1 |
tp |
fn |
Class -1 |
fp |
tn |
The sensitivity (also called
true positive rate or hit rate) and the specificity (true negative rate) as:
Sensitivity = tp/pos
Specificity = tn/neg
where pos=tp+fn is the total number of positive examples and neg=tn+fp the total
number of negative examples.
The results will be evaluated
with the so-called Area Under Curve (AUC). It corresponds to the area under the
curve obtained by plotting sensitivity against specificity by varying a
threshold on the prediction values to determine the classification result.
4. References
[1]
S.-Y. Hung, D.C. Yen, H.-Y. Wang, ¡§Applying
data mining to telecom churn management¡¨, Journal of Expert Systems with
Applications, vol.31, pp.515-524, 2006.
[2]
Rumelhart,
D. E., McClelland, J. L., & The PDP Research Group (1986). Parallel Distributed Processing:
Explorations in the Microstructure of Cognition. London: The MIT Press.
[3] Rumelhart, D.E., Hintont, G.E., Williams R.J., 1986. Learning representations by back- propagating errors. Nature 323, 533 ¡V 536.