Homework 3_1_M9907C07

Final Report KDD CUP 2011

---

Introduction     Evaluation criterion     Flow chart         Issues & Solutions        Result        Reference

---

1. Introduction

In KDD 2011, Yahoo! Music offers a wealth of information and services related to many aspects of music.

The goal is to train a model on the training set and to predict the relevant targets for each item on the test set.

Find which songs users would like to listen to.

There are two tracks:

1.The first track is aimed at predicting scores that users gave to various items. Learning to predict users' ratings of musical items.

2.The second track requires separation of loved songs from other songs. Learning to separate tracks scored highly by specific users from tracks not scored by them.

 

(a) Training data

The released data represents a sampled snapshot of the Yahoo! Music community's preferences for various musical items.

A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres.

In addition, the items are tied together within a hierarchy.

 

(b) Datasets

The two tracks of the competition employ two different datasets, which we describe below:

Track 1

The dataset is split into three subsets:

- Train data: in the file trainIdx1.txt

- Validation data: in the file validationIdx1.txt

- Test data: in the file testIdx1.txt

 

For each subset, user rating data is grouped by user.

First line for a user is formatted as:

<UsedId>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by <UsedId>, sorted in chronological order. Rating line format is:

<ItemId>\t<Score>\t<Date>\t<Time>\n

The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.

An item has at least 20 ratings in the total dataset (including train, validation, and test sets).

The dataset statistics are as follows:

#Users

#Items

#Ratings

#TrainRatings

#ValidationRatings

#TestRatings

1,000,990

624,961

262,810,175

252,800,275

4,003,960

6,005,940

 

Track 2

The dataset is split into two subsets:

- Train data: in the file trainIdx2.txt

- Test data: in the file testIdx2.txt

 

At each subset, user rating data is grouped by user.

First line for a user is formatted as:

<UsedId>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by <UsedId>. Rating line format:

<ItemId>\t<Score>\n

The scores are integers lying between 0 and 100, and are withheld from the test set. All user id's and item id's are consecutive integers, both starting at zero.

 

An item has at least 20 ratings in the total dataset (including train and test sets), and each user has at least 17 ratings in the training data.

 

For each user participating in the test set, six items are listed. All these items must be tracks (not albums, artist or genres). Three out of these six items have never been rated by the user, whereas the other three items were rated "highly" by the user, that is, scored 80 or higher.

 

The three items rated highly by the user were chosen randomly from the user's highly rated items, without considering rating time. The three test items not rated by the user are picked at random with probability proportional to their odds to receive "high" (80 or higher) ratings in the overall population.

 

The dataset statistics are as follows:

#Users

#Items

#Ratings

#TrainRatings

#TestRatings

249,012

296,111

62,551,438

61,944,406

607,032

 

A hierarchy of items similar to the one used in Track 1 is also given for Track 2. However, timestamps of the ratings, which are given in Track 1, are withheld for Track 2.

 

(c) Log Fields

1.Track,stored in “trackData.txt” - Track information formatted as:

<TrackId>|<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n

 

2.Album,stored in “albumData.txt” - Album information formatted as:

<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n

 

3.Artist,stored in “artistData.txt” - Artist listing formatted as:

<ArtistId>\n

 

4.Genre,stored in “genreData.txt” - Genre listing formatted as:

<GenreId>\n

 

2. Evaluation criterion

Track1:

will be evaluated using Root Mean Square Error (RMSE)

 : # testing data

  [0, 1] : predictions

  { 0, 1 } : true answers

 

Track2:

will be evaluated using Error rate (fraction of misclassifications).

 

3. Flow chart

4. Issues & Solutions

(1) How to read dataset?

The dataset is too large to use common program (like WoedPad or Excel) to read it, and my computer memory too small, the program have no response easy.

So I try to use LabVIEW to split dataset for several small parts.

  

Then I can use those datasets to training model.

 

(2) Data preprocessing and sampling.

There are too many noisy data and empty element in the dataset.

Because I can’t decide what is noise, so the information that incomplete will removal as the noise.

   And database is too big to red, so I load all data in MySQL and link element together.

(3) How to select feature?

As an artist in the genre is usually fixed, and the user will usually buy CD by artist.

Therefore, the primary index used here as an artist, and then order the album, track classification, finally combined with the genre and user.

Because most people have a similar style of music preferences and favorite artists. Therefore, we direct from the genre category track.

Then calculate album and artist by track proportion of the genre.

The proportion higher, users' ratings of album and artist will be higher.

 

(4) Training model.

Using MATLAB NN tool to get training model.

It should be noted that data matrix is large, training time will be unbearable.

Therefore, it must be processed in parallel computing or use powerful computer.

In the first, I find my computer memory is too small to training model, so I read MATLAB help to find how to set MATLAB memory, but it still can’t work.

Finally, I just use sample dataset for training.

In addition, the scores are integers lying between 0 and 100.

But it is too complicated to training model, training time will be long.

So I rounding all score like 132966… into 01020…90100, and then I have only ten grades to training model. Training time will be shortly.

 

(5) Prediction.

In order to improve the accuracy of classification or prediction, Ensemble Learning maybe is a good choice.

Its main idea is to combine the same type of classifier or a different type of classifier ​​for the classification of the information or projections, because most of the classifier to different data type has its exact part of the special judge.

Thus, the available variety of different learning algorithms to learn, during the final vote of each classifier or classification predicted the weighted average, was predicted to get a more accurate result.

 

5. Result

My best score is 37.4272

 

6. Reference

http://kddcup.yahoo.com/

https://pslcdatashop.web.cmu.edu/KDDCup/

MATLAB進階與工程問題應用,楊智旭/張嘉峰/楊政達,高立.

LabVIEW 程式設計入門,徐瑞隆,新文京.

PHP 5MySQL 5入門學習指南 ,凱文瑞克,旗標.