M9715042 何紹威 資工碩一
九十七學年度下學期 類神經網路 研究計畫書
In this proposal, I want to find a way to solve the KDD cup 2009. The KDD cup 2009 offers a large marketing database from the French Telecom company Orange. This is a problem of Customer Relationship Management (CRM). Customer Relationship Management (CRM) is a key element of modern marketing strategies. And Orange Labs want to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).
The dataset of the KDD cup 2009 is a very large dataset. The dataset has 15000 attributes and 50000 items. There are many noisy data (numerical and categorical variables). And the class distribution is unbalanced. The other way, customer relationship management had time-constrained. Therefore we need an efficiency way to handle the data. It is also a crucial point in CRM.
The dataset is very large. There are 15000 attributes, and 50000 items.
The dataset is offered by the French Telecom company Orange. I think I can reasonable guess the dataset is the phone log. It maybe exist many noisy data. So I listed the 6 steps for solve the problem.
This is a classification problem. There are 3 labels :
Churn: customers tend to switch provider.
Appetency: customers tend to buy new products or services.
Up-selling: customers tend to buy upgrades or add-ons proposed to them to make the sale more profitable.
Fig 1. Flow chart
Step 1 removing noisy attributes
Some attributes had the same value in every item. So I will remove these attributes. Because these attributes don’t have any information that we want. If we reserve these attributes, then they will be noisy and reduce accuracy. Therefore, I tend to remove these noisy attributes.
Step 2 missing value standardization and data standardization
In data standardization, there are two problems.
First, missing value had no definition. So I need give a definition to the missing value. I tend to give the missing value a general value. I define missing value is average value of the attribute.
Second, maximum value and minimum value between attributes is difference. Some is big, and some is small. Therefore, we need standardize the value. For every attribute, I will find maximum and minimum value. If the value is x, then let new value be (x – minimum) / (maximum-minimum). After standardizing the dataset, I can do data reduction.
Step 3 sampling
Because the dataset is very large, I want to obtain a small sample to represent the whole dataset. Therefore, I need to choose a representative subset of the data. Simple random sampling may have very poor performance in the presence of skew.
After I observe the dataset, I found the KDD cup 2009 dataset is uniformly by class label. Therefore, I choose uniformly sampling. It mean I choose the item if it module 10 is zero. In spite of sampling already reduce many complexity, 15000 attributes is a big problem of neuron model.
Step 4 attribute selection
Because 15000 attributes are too large, it is too big time complexity. Hence I need select some important attributes.
I already standardized data in step 2. Then I can do data discretization. I will use entropy to find the important attributes. Entropy is a way to find important attributes. I think I will choose 10 to 50 attributes to classify.
Step 5 classification
This is the important part of this work. I will use WEKA to classify the data. I choose Naïve Bayes to be baseline, and compare with Multilayer Perceptron and J48. In this part, let WEKA can handle large data is the most difficult part.
Step 6 data analysis
After the classification, no matter the result is good or not, find the reason and analysis it.
This is the results of small data set of KDD cup 2009. My feature selection uses “Gain Ratio”. Using 40~60 numeric variables and 34 categorical variables. My model was trained by Bayes Net.
KDD cup 2009 http://www.kddcup-orange.com/