九十七學年度下學期 類神經網路 研究計畫書(資工碩一 M9715053 王奕勛)

一、研究計畫中英文摘要:請就本計畫要點作一概述,並依本計畫性質自訂關鍵詞。(五百字以內)

 

1.       Background

Because customer base is the key factor of company profitability, each company must do his best to keep old customers and search new customers. There are many marketing strategies like CRM (Customer Relation Management), which means a company interact highly with his customers each other to understand and effect on the behavior of customers. It is a business management to raise Customer Acquisition, Customer Retention, Customer Loyalty and Customer Profitability.

2.       Task Description

The task is to estimate the probabilities of the three types of customers, and there are three target values to be predicted and a large number of variables (15,000) are made available for prediction. The Data set includes numerical and categorical variables, and unbalanced class distributions. So, time efficiency is a crucial point.

3.       Data Set

(1)     Instance: 50000 (both training set and testing set)

(2)     Variable: 14740 numerical values, and 260 categorical values

(3)     Type: churn, appetency, up-selling

(4)     Label: 1 refer to positive, and -1 refer to negative in each type

(5)     There are some missing values within data set

(6)     Churn, appetency, and up-selling are three separate binary classification problems.

4.       Three type of customers

(1)   Churn: It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time.

(2)   Appetency: The appetency is the propensity to buy a service or a product.

(3)   Up-selling: Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale.

 

二、研究計畫內容:

(一)研究計畫之背景及目的。請詳述本研究計畫之背景、目的、重要性及國內外有關本計畫之研究情況、重要參考文獻之評述等。

(二)研究方法、進行步驟及執行進度。請列述:1.本計畫採用之研究方法與原因。2.預計可能遭遇之困難及解決途徑。

(三)預期完成之工作項目及成果。請列述:1.預期完成之工作項目。

1.       Task Description

(1)   The task is to estimate the probabilities of the three types of customers. The data set is from large marketing databases of the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling). Hence, there are three target values to be predicted and a large number of variables (15,000) are made available for prediction. The Data set includes numerical and categorical variables, and unbalanced class distributions.

(2)   The main objective is to make good predictions of the target variables.

(3)   The evaluation are evaluated according to the arithmetic mean of the AUC for the three labels (churn, appetency. and up-selling).

l  Sensitivity and specificity: the official stuff define sensitivity (also called true positive rate or hit rate) and the specificity (true negative rate) as follow:

 

Sensitivity = tp/pos

Specificity = tn/neg

 

where pos=tp+fn is the total number of positive examples and neg=tn+fp the total number of negative examples.

l   Area Under Curve (AUC): It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result.

ROC

1.       Method

To processing the large data set, the first thing to do is reducing the data size. By analyzing the data set, I have something discovered and describe them as below.

(1)     Data Sets

n   Small Training Set

l   Appetency: 890

l   Churn: 3672

l   Up-selling: 3682

l   Non-target: 41756 (83%)

l   Null column: 18

l   Each instance with each label

l   Some whole columns almost can be divisible

l   Many columns are positive relationship

n   Large Training Set

l   Appetency: 890

l   Churn: 3672

l   Up-selling: 3682

l   Non-target: 41756

l   Null column: 107

l   Each instance with each label

n   The bias of training data set is very obvious.

(2)     Data Preprocessing

n   Data Cleaning

l   Delete null columns

n   Nominal to Numeric

l   Convert the string with concatenate the ASCII code of character

l   Ex: ra_5cè11497955399

n   Normalization

l   Standard deviation normalization

l   To fill missing values

n   Continuous value divided by some integer

l   Some attribute value can be divisible

(3)     Feature Selection

n   Various attribute: 51 attributes

var38.PNG

n   Monotonous attribute: 161 attributes

var40.PNG

n   GainRatio score

l   pickup top 15 attributes

n   Normalized numeric Only + 3 labels

l   174 attributes

n   For Large data set

l   Min, Max, Mean, Stdev

l   Null count, Numeric count, Nominal count

(4)     Training Model

l   MultilayerPerceptron

l   A classifier that uses backpropagation to classify instances. This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid.

l   BayseNet

(5)     Evaluation

l   The base line is Naïve Bayes, which performance as below.

NBauc

(6)     Challenge

l   It is difficult to process efficiently the large data set by simple analysis like as feature distribution. The personal computer is impossible handle the large data at short time, so I will reduce the data size firstly.

l   The second, I have no CRM background knowledge. In fact, if randomly selecting features, it could result bad performance.

l   Hence, I must consider both data bias and correct feature selection to result good performance.

2.       Experiment Environment & Results

(1)     Preprocessing:

l   Re-label: appetencyàa, churnàc, up-sellingàu, the othersàn

(2)     Results

l   51(various)/161(monotonous) attributes: the effective attributes is hidden among various and monotonous attributes.

l   MultiLayerPerceptron(1): 50000 instances, 212 normalized attributes

l   MultiLayerPerceptron(2): , 50000 instances, 174 normalized numerical attributes

l   It seems converting nominal to numeric with ASCII code is not working.

 

l   BayesNet(1): More Information is hidden within various attributes.

l   BayesNet(2)

l   Obviously, BayesNet is more adapted to the data set than MultiLayerPerceptron.

l   BayesNet(3): The result is amazing, but I can’t apply to testing data set with same processes. Besides, It is obviously over-fitting.

l   MultiLayerPerceptron(4): Large Training Data Set: My strategy doesn’t work.

3.       Conclusion

(1)     Maybe categorical attributes is key factor

(2)     Overcome the bias of data set under no any background knowledge

(3)     Adjust parameters to adapt data set

(4)     Maybe Probabilistic Neural Network (PNN) is applicable

(5)     Different target, Different model

(6)     Understand tool’s parameters is important

4.       References

(1)     類神經網路模式應用與實作, 葉怡成, 儒林圖書

(2)     Training set optimization methods for a probabilistic neural network, Mark H. Hammond, Chemometrics and Intelligent Laboratory Systems 71, 2004, P.73-78

(3)     MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004, P.107-113