l  Abstract

l  About the proposal research

l  Reference

l  Contents

l  Schedules



The first task in KDD Cup 2009 is to predict Customer Relationship Management (CRM),which is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange .

We will estimate the churn, appetency and up-selling probability of customers,

1. To predict the propensity of customers to switch provider (churn)

2. Buy new products or services (appetency)

3. Buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

This dataset had a large number of variables (15,000),and also a abundant of instances(50,000),so it is very difficult to solve it. Then we had to make a choice, one is to have more power PC, like a server pc or a supercomputer, the other choice is make a sample instead of original data, like random sampling or uniform sampling. Of course i choice the lastest.


About the proposal research


SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. Since 1995, SIGKDD has hosted the annual SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2009 will take place in Paris, France. SIGKDD sponsors the KDD Cup competition every year in conjunction with the annual conference. It is aimed at members of the industry and academia, particularly students, interested in KDD.







This is the my main flowchart



There are 5 steps in my measure.

Step 1 Feature Selection

In this step, I will explain that why we select this feature better than the other. There are a large number of attributes, so how to decide the attributes we use is important, and it is momentous to determine on how many attributes, because we have to balance performance and hardware resource.


Step 2Preceding Procedure

Because data set is large, it is good idea to make a sample. We must have two sampling method, random sampling (choice instance by random) and uniform sampling (choice instances by uniform) withal. I don’t know which one is the best approach, so I must do experiments for that. By the way, it cannot live without normalize, because data may jump abnormally. The result of normalize to lie in between 0 and 1, and fill -1 as missing value.


Step 3Training data

In training procedure, I will use the data set - orange_large_train.data.chunk, and use the tools mutilayerperceptron in weka or use the tools in matlab. Compare neural approaches with that, we also can find out the better training rules. When it is stable, it didn’t need to adjust the weight.


Step 4Prediction

In prediction, I will use the data set - orange_large_test.data.chunk.


Step 5Analysis

This is the latest step, we can analysis many things.

1.   The distribution of data set

2.   The problems on selecting few attributes instead of many

3.   Biased Sampling problems

4.   Time complexity

5.   About neuron network

(1) perceptron

(2) Linear Filters

(3) Backpropagation

(4) Radial Basis Networks

(5) Self-Organizing Map

(6) Learning Vector Quantization Network

(7 ) Rcurrent Network

(8) Adaptive Linear

(9) Network. Backpropagation is more commonly used (or BP).)

6.   Which approaches is better

7.   The precision and recall

8.   Is the result good or bad, why?




Items list

Due date

1.Suvey related papers


2.Make the feature selection


3.Preceding Procedure


4.Training data


5.Predict data