九十七學年度下學期 類神經網路 研究計畫書

一、研究計畫中英文摘要:請就本計畫要點作概述,並依本計畫性質自訂關鍵詞。(五百字以內)

 

Abstract

 

The goal of this purposeal is to join/win the KDD cup 2009 competition. The KDD cup 2009 is a competition holed by the annual ACM SIGKDD conference which is the premier international forum for data mining researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences.

Introduction to KDD cup 2009

 

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

 

Task Description

 

The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.

Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.

Appetency: In our context, the appetency is the propensity to buy a service or a product.

Up-selling (wikipedia definition): Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.

 

二、研究計畫內容:

(一)研究計畫之背景及目的。請詳述本研究計畫之背景、目的、重要性及國內外有關本計畫之研究情況、重要參考文獻之評述等。

(二)研究方法、進行步驟及執行進度。請列述1.本計畫採用之研究方法與原因。2.預計可能遭遇之困難及解決途徑。

(三)預期完成之工作項目及成果。請列述1.預期完成之工作項目。

 

A. Background

The project is designed to be a storm exercise in ANN course and as the major grading reference. Professor Lee will introduce many ANN structure and learning algorithms in this semester, and we can use these skill to improve/examine our performance in this course.

 

B. Method

1. Get required resource.

a. Hardware:

(1) Clustering computer account.

(2) Enough hard disk quota. (10GB in the beginning)

b. Algorithms/Library/Source code

(1) Clustering ANN algorithms.

(2) Adapt/fit the clustering ANN algorithms on the clustering computer.

(3) Clustering A* + ANN algorithms.

(4) Data set analyze algorithms/tools.

2. Get familiar with the resource.

a. Write a simple C code to distribute task to 300 clusters and gathering the result using on the clustering computer.

b. Write C code to implement the clustering ANN algorithms or porting/using other’s C implementation.

c. Try some small data sets to verify the correctness of the implementation.

3. Perform the KDD cup 2009.

a. Perform the KDD cup 2009 large data set.

b. Analyze the data set.

c. Sampling the data set.

4. Fine-tune the result.

a. A* + ANN

C. Expected to encounter the difficulties and possible solutions

1. I can’t get use of clustering computer. For example: distribute tasks to 300 cluster fail or gathering results for 300 clusters.

Sol: Try to look at workable sample code or call HP support.

2. Taking too much time in a single round.

Sol: Using profile tool to identify the bottle neck. Check the piece of bottle neck code weather can be optimized or paralleled. Using compiler’s optimizing option. Rewrite in assembly. Consider better algorithm. Check the implementation weather suit perfectly with the current clustering computer architecture.

3. The error rate on test set is not good enough.

Sol: Try A* on ANN

D. Achievement estimation

From 1.a.(1) to 2.c or 3.a. Time is the critical issue.