九十七學年度下學期 類神經網路 研究計畫書

一、研究計畫中英文摘要:請就本計畫要點作一概述,並依本計畫性質自訂關鍵詞。(五百字以內)

Outline

l  Abstract

l  About the proposal research

l  Reference

l  Contents

l  Schedules

 

Abstract

The first task in KDD Cup 2009 is to predict Customer Relationship Management (CRM),which is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange .

We will estimate the churn, appetency and up-selling probability of customers,

1. To predict the propensity of customers to switch provider (churn)

2. Buy new products or services (appetency)

3. Buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

This dataset had a large number of variables (15,000),and also a abundant of instances(50,000),so it is very difficult to solve it. Then we had to make a choice, one is to have more power PC, like a server pc or a supercomputer, the other choice is make a sample instead of original data, like random sampling or uniform sampling. Of course I choice the latest.

 

二、研究計畫內容:

(一)研究計畫之背景及目的。請詳述本研究計畫之背景、目的、重要性及國內外有關本計畫之研究情況、重要參考文獻之評述等。

(二)研究方法、進行步驟及執行進度。請列述:1.本計畫採用之研究方法與原因。2.預計可能遭遇之困難及解決途徑。

(三)預期完成之工作項目及成果。請列述:1.預期完成之工作項目。

 

About the proposal research

 

SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. Since 1995, SIGKDD has hosted the annual SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2009 will take place in Paris, France. SIGKDD sponsors the KDD Cup competition every year in conjunction with the annual conference. It is aimed at members of the industry and academia, particularly students, interested in KDD.

 

References

http://www.kddcup-orange.com/index.php

http://www.cs.waikato.ac.nz/ml/weka/
http://www.cs.nthu.edu.tw/~jang/mlbook/

 

Contents

This is the my main flowchart

 

 

There are 5 steps in my measure.

Step 1 Feature Selection

In this step, I use large data set to select best 750 attributes. The method is GainRatio and Information Gain.

 

 

Step 2Preceding Procedure

Because data set is large, so first thing is to divide the all data into 20 groups , each groups to training the data, and find the best 750 attribute

It’s so many time to do this, so I just do the appetency label now.

 

 

Step 3Training data

In training procedure, I will use the data set - orange_large_train.data.chunk, and use the tools BayesNet in weka . Compare neural approaches with that, we also can find out the better training rules. When it is stable, it didn’t need to adjust the weight.

This is the best 100 attributes and it’s rank


0.3621643

Var7234

0.3621643

Var7433

0.340879

Var395

0.340879

Var3106

0.340879

Var3546

0.340879

Var3300

0.340879

Var3562

0.340879

Var4991

0.340879

Var6086

0.340879

Var7724

0.340879

Var7983

0.340879

Var9917

0.340879

Var9774

0.340879

Var11009

0.340879

Var11042

0.340879

Var13042

0.340878834

Var14484

0.34087883

Var2086

0.34087883

Var11568

0.3408788

Var7475

0.3408788

Var9490

0.3408788

Var9360

0.3408788

Var12603

0.3408788

Var13902

0.2370573

Var1293

0.237057

Var10806

0.1917614

Var14091

0.171619

Var4829

0.093005

Var7549

0.093005

Var10545

0.09300474

Var11884

0.0930047

Var8661

0.0883914

Var13589

0.088391

Var10284

0.071665

Var11003

0.06417201

Var1630

0.05731

Var6227

0.0570607

Var12069

0.049261

Var9923

0.0394611

Var12192

0.039461

Var13486

0.035116

Var3556

0.035116

Var11222

0.0336098

Var4384

0.0336098

Var9139

0.033076

Var10876

0.031793

Var13188

0.028996

Var8126

0.02892801

Var11735

0.028928

Var8456

0.028928

Var12536

0.028318

Var624

0.0266745

Var8292

0.0266405

Var12617

0.02643682

Var11939

0.0264368

Var9513

0.0256115

Var7249

0.0226002

Var5282

0.0226

Var13480

0.021365667

Var14358

0.0206238

Var12599

0.020216

Var10619

0.0202156

Var8460

0.017716236

Var14686

0.017716

Var259

0.017363

Var532

0.017363

Var10923

0.017356

Var7542

0.0167824

Var5673

0.0167702

Var5512

0.0164242

Var9549

0.0164242

Var9356

0.0163071

Var9071

0.016307

Var3635

0.016081

Var3226

0.016081

Var7721

0.0160233

Var9243

0.0156786

Var5637

0.0156544

Var9322

0.015654

Var3644

0.015412

Var5075

0.0154119

Var7037

0.0152617

Var4037

0.01484646

Var11573

0.014846

Var8066

0.0145575

Var7185

0.0140737

Var14224

0.0138846

Var4194

0.013853

Var8508

0.013828

Var3611

0.0137494

Var3764

0.0135955

Var8575

0.013595

Var10959

0.013369

Var7825

0.013291061

Var14269

0.01314

Var629

0.013043

Var3315

0.013043

Var10090

0.0129981

Var4426

0.0129981

Var5428


 

Step 4Prediction

In prediction, I will use the data set - orange_large_test.data.chunk.

Because the data set is large, and the header is different, so I just have select attribute test data set, and the result not yet.

 

Step 5Analysis

This is the result of the first groups

=== Summary ===

 

Correctly Classified Instances       38638               77.276  %

Incorrectly Classified Instances     11362               22.724  %

Kappa statistic                          0.0339

Mean absolute error                      0.2339

Root mean squared error                  0.427

Relative absolute error                668.4436 %

Root relative squared error            322.9157 %

Total Number of Instances            50000    

 

=== Detailed Accuracy By Class ===

 

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class

                 0.449     0.221      0.035     0.449     0.066      0.675    1

                 0.779     0.551      0.987     0.779     0.871      0.675    -1

Weighted Avg.    0.773     0.545      0.97      0.773     0.856      0.675

 

=== Confusion Matrix ===

 

     a     b   <-- classified as

   400   490 |     a = 1

 10872 38238 |     b = -1

 

And this is  I chose the best 750 attributes groups

=== Summary ===

 

Correctly Classified Instances       32924               65.9034 %

Incorrectly Classified Instances     17034               34.0966 %

Kappa statistic                          0.0358

Mean absolute error                      0.3411

Root mean squared error                  0.5776

Relative absolute error                974.091  %

Root relative squared error            436.6799 %

Total Number of Instances            49958    

Ignored Class Unknown Instances                 42    

 

=== Detailed Accuracy By Class ===

 

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class

                 0.703     0.342      0.036     0.703     0.068      0.742    1

                 0.658     0.297      0.992     0.658     0.791      0.731    -1

Weighted Avg.    0.659     0.297      0.975     0.659     0.778      0.731

 

=== Confusion Matrix ===

 

     a     b   <-- classified as

   626   264 |     a = 1

 16770 32298 |     b = -1

 

 

We can find many thing in this result

 

1.  Correctly classified instances is down, because its label is unbalanced.

 

2.  TP rate is up , because we choose the best attributes .

 

3.  The Roc area is up