992 NN HW#3 ( KDD Cup 2011 Final Report )



Prof Hahn-Ming, Lee

Student ID


Student Name

Dong-Jie, Wu



Dataset Description

Issues & Survey

Proposed approach





Introduction [Top]

In KDD Cup 2011, the topic this year is “Learn the rhythm, predict the musical scores”, Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.


The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.


The competition is divided into two tracks:


Track1: Learning to predict users' ratings of musical items. Items can be tracks, albums, artists and genres. Items form a hierarchy, such that each track belongs to an album, albums belong to artists, and together they are tagged by genres.


Track2: Learning to separate tracks scored highly by specific users from tracks not scored by them. In track2 the test set includes six items per user (all are tracks), three of which were rated highly (score 80 or higher) by the user and three were not rated by the user. The three unrated items are sampled with a probability proportional to number of their high (>=80) ratings. The task is to classify each item as either rated or not rated by the user (1 or 0 respectively). A hierarchy of items similar to the one used in Track 1 is also given for Track 2. However, timestamps of the ratings, which are given in Track 1, are withheld for Track 2.


The targets on test sets are not given. The goal is to train a model on the training set and to predict the relevant targets for each item on the test set.




Dataset Description [Top]


-         Train data: trainIdx1.txt

-         Validation data: validationIdx1.txt

-         Test data: testIdx1.txt


For each subset, user rating data is grouped by user. First line for a user is formatted as:



Each of the next <#UserRatings> lines describes a single rating by <UsedId>, sorted in chronological order. Rating line format is:



UsedId & ItemId: Consecutive integer, both starting at zero

Score: Integer between 0 and 100

Date: Integer describing number of days elapsed since an undisclosed date


Example: ( There is no any score information in test data )


507696    90   5103      07:34:00

137915    90   5103      07:39:00


The dataset statistics are as follows:















-         Train data: trainIdx2.txt

-         Test data: testIdx2.txt


For each subset, user rating data is grouped by user. First line for a user is formatted as:



Each of the next <#UserRatings> lines describes a single rating by <UsedId>. Rating line format is:



UsedId & ItemId: Consecutive integer, both starting at zero

Score: Integer between 0 and 100


Example: ( There is no any score information in test data )


28341       90

51210       90


The dataset statistics are as follows:












Item Taxonomy

A unique feature of the datasets is a taxonomy annotating known relations between the items. Such a taxonomy is expected to be particularly useful here, due to the large number of items and the sparseness of data per item (mostly attributed to "tracks" rather than to "artists").


Recall that item id's can represent tracks, albums, artists or genres. The type of each item, including a hierarchical structure linking tracks, albums, artists and genres, is stored (separately for each the two datasets) in the following four files:


trackData.txt - Track information formatted as:

<TrackId>|<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n


albumData.txt - Album information formatted as:

<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n


artistData.txt - Artist listing formatted as:



genreData.txt - Genre listing formatted as:



Within these files, missing values are encoded as the string "None".




Issues & Survey [Top]

1.     Large amounts of data ( over 300 million ratings and over 600K items ) need to be handled and analyzed.


Here I use “Divide-and-Conquer” and combine both database and file to handle such a large dataset, because I have some observation as follow:

-    Search amounts of data: The performance of the database is better than the file

-    Search a little data: The performance of the file is better than the database


2.     There are some different categories of items, which all linked together within a defined hierarchy, need to be considered.


Here I link related items for each item, for each item, there is a file named as the item id with

related items inside.


3.     There are some missing values that need to be handled.


Although there are some missing values, I didn’t just ignore them, I try to recover them by another linking information, if there’s no any linking information with the missing value, I’ll choose to ignore it at final.

4.     Collaborative filtering: ( The following refer from wiki )


Collaborative filtering (CF) is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.


Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating). The underlying assumption of the CF approach is that those who agreed in the past tend to agree again in the future. For example, a collaborative filtering or recommendation system for television tastes could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes). Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average (non-specific) score for each item of interest, for example based on its number of votes.


Model-Based: Models are developed using data mining, machine learning algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model based CF algorithms. These include Bayesian Networks, clustering models, latent semantic models such as singular value decomposition, probabilistic latent semantic analysis, Multiple Multiplicative Factor, Latent Dirichlet allocation and markov decision process based models. There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations. The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.


5.     The rating scores are unbalanced. ( See Experiment )


Although the rating scores are unbalanced, I’ll still consider all of them, and train by users, analyzing the user’s hobby.




Proposed Approach [Top]

1.     Feature selection: Try to select the features from the large amounts of dataset, and also find the relation between features, make feature combination to reduce features’ dimension.

Features may be items’ id ( artist, album, track, genre ).

2.     Build models: I’ll apply SVM and SVR tools to build the training model by using LIBSVM.


3.     After building models, I will use those models to predict the testing data, and then compare the accuracy between different algorithms.


4.     Repeatedly tune the training models by adjusting the parameters and combining the related features.


5.     I will also analyze that for each user, which items ( artist, album, track, genre ) the user is interested in ( with high score ), here I define the score which is higher than 80 as “high score”.

On the contrary, the item with low score ( here I define the score which is lower than 20 as “low score” ) will also be analyzed.


6.     Choose the best one result as my final submission.


7.     Another approaches:

Some well-known classification algorithms may not easily apply on such a big dataset, so I just try some following simple approaches on the dataset.


User rating average:

Calculate each user’s average rating score, then use them to predict new items.


Item rating average:

Calculate each item’s average score rated by users, then use them to predict new items.


Enhanced user rating average:

Find all relations between each item rated by user and its related items ( artist -> albums -> tracks -> genres ), and parse the training data giving those related items the same score as the main ( original ) one, then use “User rating average” to predict new items.





Experiment [Top]

Due to the large data problem, I couldn’t apply any well-known machine learning, data mining or neuron network algorithms to train models from the training data and do the prediction, until now, I just make some analysis and simple experiments, I will keep trying to handle that problem ( maybe only use the part of data for training, or find better features and do normalization ), some experiments I have done yet are described as follow :


1.     Insert training data and other information data into database to help myself quickly retrieve some useful information for experiments.

( Compare with file I/O, database solution is usually more faster than file I/O. )

( Here I cooperate with M9915080 )

描述: 描述: 描述: 描述: 描述: 描述: 描述: C:\Users\Dongjie\Desktop\DB.bmp


2.     Calculate the distribution of each score ( 0-100 ), I found that the training data is an unbalanced data. ( See result )

描述: 描述: 描述: 描述: 描述: 描述: C:\Users\Dongjie\Desktop\未命名.bmp

( Complete version see here )


3.     User-oriented experiment

Calculate the average score of each user’s rating records, then use these scores to predict the testing data for each user, for example, if a user’s average rating score is 70 in training data, then in the predicting state, all of scores the user rates would be 70, the result ( RMSE: 29.9377 ) seems not too bad, but I don’t satisfy with that, I think this is only a simple experiment, I should find some features, build models and then do the prediction, the result may be better than now.


4.     Item-oriented experiment

Calculate the average score of each item rated by users, then use these scores to predict the testing data for each item, for example, if an item’s average score is 80 in training data, then in the predicting state, all of that item’s score would be 80, the result ( RMSE: 31.6449 ) seems not good enough, and also not better than previous one ( User-oriented experiment )


5.     Enhanced User-oriented experiment

The approach is described as above, because of the large dataset, I’m now continually linking the items’ relations, I expect it may take 2~3 days to finish ( before I tried to improve the code, it may take over 2 weeks ), after that, once an item need to be rated by a user, I’ll search all of related items which the user have rated before, then give the item average score of them.

( You may find that it is similar to “User-oriented experiment”, so here I called “Enhanced” User-oriented experiment, just do some improvements. )


All related items are saved in the file named by item id ( Continually linking )

描述: 描述: 描述: 描述: C:\Users\Dongjie\Desktop\未命名.bmp


Take a genre item for example, you can see there are totally 45239 items related to genre id 178912

描述: 描述: 描述: 描述: C:\Users\Dongjie\Desktop\未命名1.bmp


6.     Using Libsvm tool

After building all of the item relation mappings, I have tried to make the input file of Libsvm tool, but I encountered a problem that the input file is too big to save in the computer because of high complexity, so that I couldn’t successfully apply the Libsvm tool, I am now continue to try dimension reduction, hoping that I can successfully apply the Libsvm tool and get the predicting result in the final.


Following is the part of input file which the training score is 0

描述: 描述: 描述: C:\Users\Dongjie\Desktop\未命名.bmp




Result [Top]

Track1 :

描述: 描述: 描述: 描述: C:\Users\Dongjie\Desktop\未命名.bmp




Reference [Top]

1.      KDD 2011, http://www.kdd.org/kdd2011/kddcup.shtml

2.      KDD CUP, http://kddcup.yahoo.com

3.      Collaborative filtering, http://en.wikipedia.org/wiki/Collaborative_filtering

4.      A Survey of Collaborative Filtering Techniques, http://www.hindawi.com/journals/aai/2009/421425/

5.      Weka, http://www.cs.waikato.ac.nz/ml/weka/

6.      LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/