Neuron Network – Homework 3 (KDD cup 2011 Final Report)

M9915009 林雅惠

l   Issue

l   Proposed approach

l   Experiment & Result

Issue:

My main issue is to speed up the computation, so there are some issues I have to study:

l   Dimension Reduction

n   Instance reduction

n   Feature Selection

n   Missing data

l   Choosing a low complexity algorithm

-top-

Proposed approach:

Statistic Analysis:

For each user, using average, standard deviation to define high score, and low score

l   High score range: [Average + 1 * standard deviation, 100]

l   Low score range: [0, Average - 1 * standard deviation]

 

Dimension Reduction – Instance reduction

l   By statistic result, remove instance where user’s score standard deviation lower than 10

l   By definition of high score and low score, labeling each instance’s score, removing normal score instances.

n   1: high score

n   0: normal score (Remove this kind instance!)

n   -1: low score

 

Dimension Reduction – Feature reduction

l   Reserve genre and artist as main feature, remove other features: track, album, date, time

n   Most items in training data are tracks and album, corresponding track ID to find artist ID

u  For example:

Original instance: 0(userID), 0(trackID), 1(label)

Mapped instance: 0, 587636(artistID),1

l   Merge artist and genre to be a feature

n   For example:

Instance: 0,X,1

X could be artist ID or genre ID

 

Missing data

l   If a missing data occurred in dataset, I will remove this instance.

 

Choosing algorithm

l   Decision tree is a resource low-cost algorithm, so I choose this algorithm.

-top-

Experiments & Result:

Statistic Analysis:

Figure. 1 Track1 – train data standard deviation

Figure. 2 Track2 – train data standard deviation

Above 1/3 user, scores they given not change much whether like or not.

l   Track1: 38.97% standard distribution between 0-10.

l   Track2: 31.46% standard distribution between 0-10.

 

文字方塊: Track1 training data:
3.68GB è 0.99 GB
File be reduced to origin’s 26%
Dimension Reduction:

(Track1 training data)

KDD cup official provided: 5.55 GB

After pre-processing and feature selection: 3.68

After all dimension reduction processing: 0.99 GB

 

文字方塊: Track1 computation cost time:
: 435037
Cost time is reduced to origin’s 28%
Track1 training data score:
29.9577: 29.9577
The same score
Cost time with Predict computation:

 

Original Dataset:

Totally 252,800,275 instances

Cost Time: 1507995 ms

KDD submission score:

29.9577

After dimension reduction:

Totally 67,414,254 instances

Cost Time: 435037 ms

KDD submission score:

29.9577

 

Figure. 3 Scores

 

-top-