¤E¤Q¤C¾Ç¦~«×¤U¾Ç´Á Ãþ¯«¸gºô¸ô ¬ã¨spµe®Ñ(¸ê¤uºÓ¤@ M9715053 ¤ý«³³Ô)

¤@¡B¬ã¨spµe¤¤^¤åºKn¡G½Ð´N¥»pµenÂI§@¤@·§z¡A¨Ã¨Ì¥»pµe©Ê½è¦ÛqÃöÁäµü¡C¡]¤¦Ê¦r¥H¤º¡^

1.
Background

Because
customer base is the key factor of company profitability, each company must do
his best to keep old customers and search new customers. There are many marketing
strategies like CRM (Customer Relation Management), which means a company
interact highly with his customers each other to understand and effect on the
behavior of customers. It is a business management to raise Customer
Acquisition, Customer Retention, Customer Loyalty and Customer Profitability.

2.
Task
Description

The
task is to estimate the probabilities of the three types of customers, and
there are three target values to be predicted and a large number of variables
(15,000) are made available for prediction. The Data set includes numerical and
categorical variables, and unbalanced class distributions. So, time efficiency
is a crucial point.

3.
Data Set

(1)
Instance:
50000 (both training set and testing set)

(2)
Variable:
14740 numerical values, and 260 categorical values

(3)
Type: churn,
appetency, up-selling

(4)
Label: 1
refer to positive, and -1 refer to negative in each type

(5)
There are
some missing values within data set

(6)
Churn,
appetency, and up-selling are three separate binary classification problems.

4.
Three type
of customers

(1) Churn: It is one of two primary
factors that determine the steady-state level of customers a business will
support. In its broadest sense, churn rate is a measure of the number of
individuals or items moving into or out of a collection over a specific period
of time.

(2) Appetency: The appetency is the
propensity to buy a service or a product.

(3) Up-selling: Up-selling is a sales
technique whereby a salesman attempts to have the customer purchase more
expensive items, upgrades, or other add-ons in an attempt to make a more
profitable sale.

¤G¡B¬ã¨spµe¤º®e¡G

¡]¤@¡^¬ã¨spµe¤§I´º¤Î¥Øªº¡C½Ð¸Ôz¥»¬ã¨spµe¤§I´º¡B¥Øªº¡B«n©Ê¤Î°ê¤º¥~¦³Ãö¥»pµe¤§¬ã¨s±¡ªp¡B«n°Ñ¦Ò¤åÄm¤§µûzµ¥¡C

¡]¤G¡^¬ã¨s¤èªk¡B¶i¦æ¨BÆJ¤Î°õ¦æ¶i«×¡C½Ð¦Cz¡G1.¥»pµe±Ä¥Î¤§¬ã¨s¤èªk»Pì¦]¡C2.¹wp¥i¯à¾D¹J¤§§xÃø¤Î¸Ñ¨M³~®|¡C

¡]¤T¡^¹w´Á§¹¦¨¤§¤u§@¶µ¥Ø¤Î¦¨ªG¡C½Ð¦Cz¡G1.¹w´Á§¹¦¨¤§¤u§@¶µ¥Ø¡C

1.
Task
Description

(1) The task is to estimate the
probabilities of the three types of customers. The data set is from large
marketing databases of the French Telecom company Orange to predict the
propensity of customers to switch provider (churn), buy new products or
services (appetency), or buy upgrades or add-ons proposed to them to make the
sale more profitable (up-selling). Hence, there are three target values to be
predicted and a large number of variables (15,000) are made available for
prediction. The Data set includes numerical and categorical variables, and
unbalanced class distributions.

(2) The main objective is to make good
predictions of the target variables.

(3) The evaluation are evaluated
according to the arithmetic mean of the AUC for the three labels (churn,
appetency. and up-selling).

l Sensitivity and specificity: the
official stuff define sensitivity (also called true positive rate or hit rate)
and the specificity (true negative rate) as follow:

Sensitivity = tp/pos

Specificity = tn/neg

where pos=tp+fn is the total number of
positive examples and neg=tn+fp the total number of negative examples.

l Area Under Curve (AUC): It
corresponds to the area under the curve obtained by plotting sensitivity
against specificity by varying a threshold on the prediction values to
determine the classification result.

2.
Method

To processing the large data set,
the first thing to do is reducing the data size. By analyzing the data set, I
have some discovery. The whole method include data cleaning, uniform sampling,
feature selection, training model, and describe them as below.

(1)
Data
Cleaning: reduce noisy data

l The data set has 67 instances, which
are not with any values.

l There are no any values in 1000
whole columns

(2)
Uniform
Sampling

l After observing five cut training
data files, in fact, they have close ratio of positive and negative, which is
about 1:2.

l By uniform sampling, I draw one
instance per 10 instances. Finally, I got 5000 instances.

From upper
left corner to lower right corner, the graphical order is whole label, chunk1
file, chunk2 file, chunk3 file, chunk4 file, chunk5 file.

(3)
Feature
Selection

l I will choice five features
separately from numerical and categorical attributes.

l So far, the five candidate
attributes of the first 100 attributes are among five chunked files as below.

Chunk File Number |
Candidate Attributes |

1 |
34,32,33,37,35 |

2 |
10,35,34,37,38 |

3 |
76,33,32,36,34 |

4 |
100,34,32,33,37 |

5 |
34,32,33,37,35 |

l But, I will select again after
reducing data set size.

(4)
Training Model

l MultilayerPerceptron

A classifier that uses backpropagation to
classify instances. This network can be built by hand, created by an algorithm
or both. The network can also be monitored and modified during training time.
The nodes in this network are all sigmoid.

l Votedperceptron

A implementation of the voted perceptron
algorithm by Freund and Schapire. Globally replaces all missing values, and
transforms nominal attributes into binary ones.

l Features: five numerical and five
categorical attributes.

(5)
Evaluation

l The base line is Naïve Bayes, which
performance as below.

(6)
Challenge

l It is difficult to process
efficiently the large data set by simple analysis like as feature distribution.
The personal computer is impossible handle the large data at short time, so I
will reduce the data size firstly.

l The second, I have no CRM background
knowledge. In fact, if randomly selecting features, it could result bad
performance.

l Hence, I must consider both data
bias and correct feature selection to result good performance.