Neuron Network ¡V Homework 3 (KDD cup 2011
Final Report)


M9915009 ªL¶®´f


l
Issue 
Issue:
My main issue is to speed up the computation, so there are
some issues I have to study: l
Dimension Reduction n
Instance reduction n
Feature Selection n
Missing data l
Choosing a low complexity
algorithm 

Proposed approach:
Statistic Analysis: For each user, using average, standard deviation to define high score,
and low score l
High score range: [Average + 1
* standard deviation, 100] l
Low score range: [0, Average 
1 * standard deviation] Dimension Reduction ¡V Instance
reduction l
By statistic result, remove
instance where user¡¦s score standard deviation lower than 10 l
By definition of high score and
low score, labeling each instance¡¦s score, removing normal score instances. n
1: high score n
0: normal score (Remove this
kind instance!) n
1: low score Dimension Reduction ¡V Feature
reduction l
Reserve genre and artist as
main feature, remove other features: track, album, date, time n
Most items in training data are
tracks and album, corresponding track ID to find artist ID u
For example: Original instance: 0(userID), 0(trackID), 1(label) Mapped instance: 0, 587636(artistID),1 l
Merge artist and genre to be a
feature n
For example: Instance:
0,X,1 X could be artist ID or genre ID Missing data l
If a missing data occurred in
dataset, I will remove this instance. Choosing algorithm l
Decision tree is a resource lowcost
algorithm, so I choose this algorithm. 

Experiments
& Result:
Statistic Analysis: Figure. 1 Track1 ¡V train data standard deviation Figure. 2 Track2 ¡V train data standard deviation Above 1/3
user, scores they given not change much whether like or not. l
Track1: 38.97% standard
distribution between 010. l
Track2: 31.46% standard
distribution between 010. Dimension
Reduction: (Track1 training data) KDD cup official provided: 5.55 GB After preprocessing and feature selection: 3.68 After all dimension reduction processing: 0.99 GB Cost time with
Predict computation:
Figure. 3 Scores 