資工碩一   M9715044   王瑞遠

九十七學年度下學期 類神經網路 研究計畫書

Outline

    Abstract

    My solution

    References

 

一、研究計畫中英文摘要:請就本計畫要點作一概述,並依本計畫性質自訂關鍵詞。(五百字以內)

Abstract

This is a proposal of the homework in the Neuron Network course. This homework is about KDD CUP 2009. I will use Neuron Network method to classify the dataset in KDD CUP 2009. The dataset in KDD CUP 2009 is about Customer Relationship Management. Customer Relationship Management is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

In a CRM system, to build knowledge on customer is to produce scores. The output of a model is an evaluation for all instances of a target variable to explain. Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system, for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

Because there are 15000 attributes and 50000 instances in the dataset, so the challenge is to deal with a very large database, including numerical and categorical variables, and unbalanced class distributions. Time efficiency is also a crucial point. Therefore the part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.       

 

二、研究計畫內容:

(一)研究計畫之背景及目的。請詳述本研究計畫之背景、目的、重要性及國內外有關本計畫之研究情況、重要參考文獻之評述等。

(二)研究方法、進行步驟及執行進度。請列述:1.本計畫採用之研究方法與原因。2.預計可能遭遇之困難及解決途徑。

(三)預期完成之工作項目及成果。請列述:1.預期完成之工作項目。

My solution

  This KDD CUP 2009 dataset is about Customer Relationship Management. Customer Relationship Management is often an important issue to a company. A good handling of the Customer Relationship Management will help the company to realize that this customer will keep buying their products or change their mind to another company. So by this practice of analyzing the dataset will help me know how to handle this kind of question, by the way, because of the huge dataset, I must use a different way to deal with the dataset.

  Here is my solution of dealing with the KDD dataset, I separate them into four steps:

1. Feature Selection

2. Data Preprocessing and Data Sampling

3. Classification

4. Prediction

5. Analyze the Prediction Result

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1 the work flow of my solution

 

1. Feature Selection

Because there are 15000 attributes in the dataset, and I think there are must some useless attributes in the dataset. So the first thing that comes to my mind is feature selection. After seeing the dataset, I dont see any attribute names and labels. And I also find out that there are some attributes with no values, so I delete them first. But there still 14000 attributes, that is still a hard work. Because I dont know the attribute name of each attribute, so I couldnt understand the how many important attributes are there in the dataset.

The first thing come to my mind is that after deleting the attributes with no values, I will try to select a attribute for each 10 attribute first. In another word, it will remain less than 1400 attributes.

After doing these jobs, I found out that the distributions of these data are regular in the small datasets, as the same as the large data sets. I doing feature selection jobs of the small datasets in weka, I select 61 attributes in appetency, 61 attributes in churn, 85 attributes in upselling.

 

2. Data Preprocessing and Data Sampling

After feature selection, there are still 50000 instances in the dataset.

To reduce the size of the dataset, I will sample the dataset. First I will cut the dataset to 10 small datasets, each small dataset contain 5000 instances. In that case, I will sample the 10 small dataset, after doing that, I will combine the 10 sampling datasets to a dataset. But this might not be a good sampling way, but I will try first, and find another good way.

3. Classification

In this phase, I will use the Multilayerperceptron method in weka or another tools in Matlab to training the dataset with processing. And use Naïve Bayes to be my base line.

   While doing this jobs, I use Bayesian Nets to train the small data, and if I wanted to train the dataset in Multilayerperceptron, my computer can not handle that much data. Maybe I should do sampling or use less attributes.

4. Prediction

After training the dataset, I will get some models (Neuron Network methods and Bayesian Net). The KDD CUP 2009 testing set will be predicted with the models.

    Prediction result:

    Small data:

      Training:

      Appetency      Churn      Upselling

      Testing result:

      Appetency      Churn      Upselling

 

5. Analyze the Prediction Result

After training and testing the dataset, analyzing the prediction result is an important way to realize the methods that I propose is good or not. In this phase I will see if the prediction result is good or not, if the precision is not good enough, I will try to find out why this method will produce the bad result. It will help me to improve finding problems and solving problems.

 

References                                 

http://www.kddcup-orange.com/

http://www.cs.waikato.ac.nz/ml/weka/