九十七學年度下學期 類神經網路 研究計畫書(資工碩一 M9715053 王奕勛)

一、研究計畫中英文摘要:請就本計畫要點作一概述,並依本計畫性質自訂關鍵詞。(五百字以內)

 

1.       Background

Because customer base is the key factor of company profitability, each company must do his best to keep old customers and search new customers. There are many marketing strategies like CRM (Customer Relation Management), which means a company interact highly with his customers each other to understand and effect on the behavior of customers. It is a business management to raise Customer Acquisition, Customer Retention, Customer Loyalty and Customer Profitability.

2.       Task Description

The task is to estimate the probabilities of the three types of customers, and there are three target values to be predicted and a large number of variables (15,000) are made available for prediction. The Data set includes numerical and categorical variables, and unbalanced class distributions. So, time efficiency is a crucial point.

3.       Data Set

(1)     Instance: 50000 (both training set and testing set)

(2)     Variable: 14740 numerical values, and 260 categorical values

(3)     Type: churn, appetency, up-selling

(4)     Label: 1 refer to positive, and -1 refer to negative in each type

(5)     There are some missing values within data set

(6)     Churn, appetency, and up-selling are three separate binary classification problems.

4.       Three type of customers

(1)   Churn: It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time.

(2)   Appetency: The appetency is the propensity to buy a service or a product.

(3)   Up-selling: Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale.

 

 

 

 

 

 

 

二、研究計畫內容:

(一)研究計畫之背景及目的。請詳述本研究計畫之背景、目的、重要性及國內外有關本計畫之研究情況、重要參考文獻之評述等。

(二)研究方法、進行步驟及執行進度。請列述:1.本計畫採用之研究方法與原因。2.預計可能遭遇之困難及解決途徑。

(三)預期完成之工作項目及成果。請列述:1.預期完成之工作項目。

1.       Task Description

(1)   The task is to estimate the probabilities of the three types of customers. The data set is from large marketing databases of the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling). Hence, there are three target values to be predicted and a large number of variables (15,000) are made available for prediction. The Data set includes numerical and categorical variables, and unbalanced class distributions.

(2)   The main objective is to make good predictions of the target variables.

(3)   The evaluation are evaluated according to the arithmetic mean of the AUC for the three labels (churn, appetency. and up-selling).

l  Sensitivity and specificity: the official stuff define sensitivity (also called true positive rate or hit rate) and the specificity (true negative rate) as follow:

 

Sensitivity = tp/pos

Specificity = tn/neg

 

where pos=tp+fn is the total number of positive examples and neg=tn+fp the total number of negative examples.

l   Area Under Curve (AUC): It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result.

ROC

2.       Method

To processing the large data set, the first thing to do is reducing the data size. By analyzing the data set, I have some discovery. The whole method include data cleaning, uniform sampling, feature selection, training model, and describe them as below.

(1)     Data Cleaning: reduce noisy data

l   The data set has 67 instances, which are not with any values.

l   There are no any values in 1000 whole columns

(2)     Uniform Sampling

l  After observing five cut training data files, in fact, they have close ratio of positive and negative, which is about 1:2.

l  By uniform sampling, I draw one instance per 10 instances. Finally, I got 5000 instances.

labeltrain1train2train3train4train5

From upper left corner to lower right corner, the graphical order is whole label, chunk1 file, chunk2 file, chunk3 file, chunk4 file, chunk5 file.

 

(3)     Feature Selection

l   I will choice five features separately from numerical and categorical attributes.

l  So far, the five candidate attributes of the first 100 attributes are among five chunked files as below.

Chunk File Number

Candidate Attributes

1

34,32,33,37,35

2

10,35,34,37,38

3

76,33,32,36,34

4

100,34,32,33,37

5

34,32,33,37,35

l  But, I will select again after reducing data set size.

(4)     Training Model

l   MultilayerPerceptron

A classifier that uses backpropagation to classify instances. This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid.

l   Votedperceptron

A implementation of the voted perceptron algorithm by Freund and Schapire. Globally replaces all missing values, and transforms nominal attributes into binary ones.

l   Features: five numerical and five categorical attributes.

(5)     Evaluation

l   The base line is Naïve Bayes, which performance as below.

NBauc

(6)     Challenge

l   It is difficult to process efficiently the large data set by simple analysis like as feature distribution. The personal computer is impossible handle the large data at short time, so I will reduce the data size firstly.

l   The second, I have no CRM background knowledge. In fact, if randomly selecting features, it could result bad performance.

l   Hence, I must consider both data bias and correct feature selection to result good performance.