Neural Networks for KDD Cup 2009

Abstrac

SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining, and has hosted the annual SIGKDD International Conference on KDD since 1995.Every year SIGKDD sponsors the KDD Cup competition in conjunction with the annual conference. On this year, KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling). In this course, which can learning use of neural network methods to achieve task of KDD Cup competition in this year. And using database by KDD Cup to predict three taskChun, Appetency and Up-selling.

 

Pass Tasks

l  KDD-Cup 2008, Breast cancer

l  KDD-Cup 2007, Consumer recommendations

l  KDD-Cup 2006, Pulmonary embolisms detection from image data

l  KDD-Cup 2005, Internet user search query categorization

l  KDD-Cup 2004, Particle physics; plus Protein homology prediction

l  KDD-Cup 2003, Network mining and usage log analysis

l  KDD-Cup 2002, BioMed document; plus Gene role classification

l  KDD-Cup 2001, Molecular bioactivity; plus Protein locale prediction

l  KDD-Cup 2000, Online retailer website clickstream analysis

l  KDD-Cup 1999, Computer network intrusion detection

l  KDD-Cup 1998, Direct marketing for profit optimization

l  KDD-Cup 1997, Direct marketing for lift curve optimization

Introduction of KDD Cup 2009

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

l   CRM(from Wikipedia,)
Customer relationship management (CRM) consists of the processes a company uses to track and organize its contacts with its current and prospective customers. CRM software is used to support these processes; information about customers and customer interactions can be entered, stored and accessed by employees in different company departments. Typical CRM goals are to improve services provided to customers, and to use customer contact information for targeted marketing.

While the term CRM generally refers to a software-based approach to handling customer relationships, most CRM software vendors stress that a successful CRM effort requires a holistic approach. CRM initiatives often fail because implementation was limited to software installation, without providing the context, support and understanding for employees to learn, and take full advantage of the information systems. CRM can be implemented without major investments in software, but software is often neccessary to explore the full benefits of a CRM strategy.

l   Churn Rate(from wikipedia)
Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.

l   Appetency
In their context, the appetency is the propensity to buy a service or a product.



l   Up-selling(from wikipedia)
Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.

Study Background

The KDD Cup competition is using Data mining to achieve different topic of every year with different large database, and then predict different task of result. So, the first step have to understand what is data mining, in the wiki defined data mining as the process of extracting hidden patterns from large amounts of data. Data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. Knowledge Discovery in Databases (KDD), is the name coined by Gregory Piatetsky-Shapiro in 1989 to describe the process of finding interesting, interpreted, useful and novel data. There are many nuances to this process, but roughly the steps are to preprocess raw data, mine the data, and interpret the results. There are three process of data mining(from wiki)

I.   Pre-processing
Once the objective for the KDD process is known, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse.

The target set is then cleaned. Cleaning removes the observations with noise and missing data.

The clean data is reduced into feature vectors, one vector per observation. A feature vector is a summarized version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.

 

The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.

 

 

II.        Data mining
Data mining commonly involves four classes of task:

n Classification - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network.

n Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.

n Regression - Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming.

n Association rule learning - Searches for relationships between variables. For example a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as "market basket analysis"

III.     Interpreting the results
The final step of knowledge discovery from data is to evaluate the patterns produced by the data mining algorithms. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called over fitting. To overcome this, the evaluation uses a "test set" of data which the data mining algorithm was not trained on. The learnt patterns are applied to this "test set" and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a "training set" of sample emails. Once trained, the learnt patterns would be applied to the "test set" of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves.

 

If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge

 

This year, KDD Cup competition is about CRM, and using data mining method to predict three taskschurn, appetency and up-selling. In the wiki, There are several different approaches to CRM, with different software packages focusing on different aspects. In general, Customer Service, Campaign Management and Sales Force Automation form the core of the system . One of pattern is analytical CRM, Analytical CRM analyzes customer data for a variety of purposesDesigning and executing targeted marketing campaignDesigning and executing campaigns, e.g. customer acquisition, cross-selling, up-sellingAnalysing customer behavior in order to make decisions relating to products and services (e.g. pricing, product development)Management information system (e.g. financial forecasting and customer profitability analysis).

The objectives of a CRM strategy must consider a company’s specific situation and its customers' needs and expectations. Information gained through CRM initiatives can support the development of marketing strategy by developing the organization's knowledge in areas such as identifying customer segments, improving customer retention, improving product offerings (by better understanding customer needs), and by identifying the organization's most profitable customers. CRM strategies can vary in size, complexity, and scope. Some companies consider a CRM strategy only to focus on the management of a team of salespeople. However, other CRM strategies can cover customer interaction across the entire organization.

 

Study methods

 

I.       Database
In the fast challenge give train and test sets, which contain 50,000 examples with 14,740 variables and 260 categorical data. And toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.

II.    Results File Format
The prediction values should be real numbers corresponding to a score, small for the negative class and large for the positive class. Particular cases of valid score include

l  Binary {-1, +1} values indicating class membership

l  Discriminant values, negative for the negative class and positive for the positive.

l  A score between 0 and 1 interpretable as the probability of membership of the example to the positive.

l  A rank, smallest values representing examples classified with highest confidence as members of the negative.


III. Neural Networks Modual
I think the toy target can predicted by support vector machine method(SVM),because the toy database just only has binary value that are 1 or -1. So, it can using SVM method for classification and regression. SVM are a set of related supervised learning methods used for classification and regression. Viewing input data as two sets of vectors in an n-dimensional space, an SVM will construct a separating hyper plane in that space, one which maximizes the margin between the two data sets. To calculate the margin, two parallel hyper planes are constructed, one on each side of the separating hyper plane, which are "pushed up against" the two data sets. Intuitively, a good separation is achieved by the hyper plane that has the largest distance to the neighboring data points of both classes, since in general the larger the margin the lower the generalization error of the classifier.

In addition, the large database is predicted for three tasks, but I can’t understand what is those database and that three tasks to stand for. As a reason, I have no idea to decided on what kind of neural networks modual can to use.

IV. Tool
I choose the weka(Waikato Environment for Knowledge Analysis) software for this KDD Cup 2009, because it is free software. Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato. The history of weka(from wiki)

l  In 1993, the University of Waikato in New Zealand started development of the original version of Weka (which became a mixture of TCL/TK, C, and Makefiles).

l  In 1997, the decision was made to redevelop Weka from scratch in Java, including implementations of modelling algorithms.

l  In 2005, Weka receives the SIGKDD Data Mining and Knowledge Discovery Service Award.

l  In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business intelligence. It forms the data mining and predictive analytics component of the Pentaho business intelligence suite.

 

 

 

Pre-processing

The large database is too big , so that the weka can’t loading . so it must to be down sample , I reduced to 15000 attributes by more than 200. Because it have too many useless attributes , i.e. missing value more than 90% , or when 0 of value is more than 90% , or attributes just only have two type. And the small database I reduced to 73 attributes.

 

I use weka to prediction large and small data by naive Bayes Algorithm. A naive Bayes classifier is a term in Bayesian statistics dealing with a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

 

 

 

Naive Bayes Classifier

 

l   The naive Bayes classifier assigns an instance sk with attribute values (A1=v1, A2=v2, …, Am=vm ) to class Ci with maximum Prob(Ci|(v1, v2, …, vm)) for all i.

l   The naive Bayes classifier exploits the Bayes’s rule and assumes independence of attributes

l   Likelihood of sk belonging to Ci



l   Likelihood of sk belonging to Cj



l   Therefore, when comparing Prob(Ci| (v1, v2, …, vm)) and P(Cj |(v1, v2, …, vm)), we only need to compute P((v1, v2, …, vm)|Ci)P(Ci) and P((v1, v2, …, vm)|Cj)P(Cj)

l   Under the assumption of independent attributes





l   Furthermore, P(Cj) can be computed by



l   If Data Sets is Numerical Attribute Values .Let x1, x2, …, xn be the values of a numerical attribute in the training data set



 

 

 

 

 

 

 

 

 

Large
Churn Prediction (large)

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.681

0.499

0.945

0.681

0.792

0.624

-1

0.501

0.319

0.111

0.501

0.181

0.624

1

 

 

Churn(large)

Prediction

Class  -1

Class  1

Truth

Class  -1

31571

14757

Class  1

1834

1838

 

 

 

 

Churn Prediction (large– 10% test)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.69

0.496

0.942

0.69

0.797

0.64

-1

0.504

0.31

0.122

0.504

0.197

0.64

1

 

 

Churn (large – 10%test)

Prediction

Class  -1

Class  1

Truth

Class  -1

3178

1427

Class  1

196  

199

 

 

 

 

Appetency Prediction (large)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.566

0.335

0.989

0.566

0.72

0.643

-1

0.665

0.434

0.027

0.665

0.052

0.645

1

 

 

Appetency (large)

Prediction

Class  -1

Class  1

Truth

Class  -1

27786

21324

Class  1

298  

592

 

 

 

 

 

Appetency Prediction (large– 10% test)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.555

0.418

0.986

0.555

0.71

0.591

-1

0.582

0.445

0.024

0.582

0.045

0.587

1

 

 

Appetency (large – 10%test)

Prediction

Class  -1

Class  1

Truth

Class  -1

2723

2186

Class  1

38  

53

 

 

 

 

Up-selling Prediction (large)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.555

0.19

0.973

0.555

0.707

0.702

-1

0.81

0.445

0.126

0.81

0.218

0.709

1

 

 

Up-selling(large)

Prediction

Class  -1

Class  1

Truth

Class  -1

25692

20626

Class  1

701  

2981

 

 

 

 

 

Up-selling Prediction (large– 10% test)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.557

0.194

0.973

0.557

0.709

0.706

-1

0.806

0.443

0.127

0.806

0.22

0.706

1

 

 

Up-selling (large – 10%test)

Prediction

Class  -1

Class  1

Truth

Class  -1

2580

2049

Class  1

72  

299

 

 

 

 

Small

Churn Prediction (small)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.325

0.219

0.949

0.325

0.484

0.584

-1

0.781

0.675

0.084

0.781

0.152

0.584

1

 

 

Churn (small)

Prediction

Class  -1

Class  1

Truth

Class  -1

15058

31270

Class  1

805  

2867

 

 

 

 

 

Churn Prediction (small – 10% test)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.25

0.194

0.943

0.25

0.395

0.544

-1

0.806

0.75

0.077

0.806

0.141

0.544

1

 

 

Churn (small)

Prediction

Class  -1

Class  1

Truth

Class  -1

1159

3480

Class  1

70  

291

 

 

 

 

 

 

 

Appetency Prediction (small)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.907

0.863

0.983

0.907

0.943

0.673

-1

  0.137

  0.093

0.026

0.137

0.044

0.673

1

 

 

Churn (small)

Prediction

Class  -1

Class  1

Truth

Class  -1

44541

4569

Class  1

768 

122

 

 

 

 

 

 

 

Appetency Prediction (small – 10% test)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.912

0.853

0.981

0.912

0.945

0.634

-1

0.147

0.088

0.034

0.147

0.055

0.634

1

 

 

Churn (small)

Prediction

Class  -1

Class  1

Truth

Class  -1

4467

431

Class  1

87  

15

 

 

 

 

 

 

 

Up-selling Prediction (small)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.318

0.158

0.962

0.318

0.479

0.661

-1

  0.842

  0.682

0.089

0.842

0.162

0.661

1

 

 

Churn (small)

Prediction

Class  -1

Class  1

Truth

Class  -1

14751

31567

Class  1

581 

3101

 

 

 

 

 

 

 

Up-selling Prediction (small – 10% test)

 

 

TP Rate

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0.348

0.23

0.948

0.348

0.509

0.634

-1

0.77

0.652

0.089

0.77

0.16

0.634

1

 

 

Churn (small)

Prediction

Class  -1

Class  1

Truth

Class  -1

1605

3012

Class  1

88  

295

 

 

 

 

 

 

 

Results

 

 

Large

Churn

Appetency

Up-selling

Score

train

10%test

train

10%test

train

10%test

0.591

0.597

0.616

0.569

0.683

0.682

0.616

 

 

Small

Churn

Appetency

Up-selling

Score

train

10%test

train

10%test

train

10%test

0.554

0.528

0.522

0.53

0.58

0.559

0.539