Problem Description

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.


A distinctive feature of the dataset used in KDD Cup 2011 is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its ablum, performing artist and associated genres. Similarly we have artist and genre annotation for the albums.


The competition is divided into two tracks:


Track1: Learning to predict users' ratings of musical items.

Item can be tracks, albums, artist and genres. Items from a hierarchy, such that each track belongs to an album, albums belong to artists, and together they are tagged by genres.


The given data train data contains a number of rating, each of which contains information about UserId, ItemId (represent track), Score, Date, and Time. The scores are integers lying between 0 and 100. This train data will be processed until it produces a pattern of how the tracks are rated by users.


The test data holds information about UserId, ItemId, Date, and Time of ratings. The result of train data processing must be able to predict what score rating that will be given by user for a track at a certain time, since the rating pattern is also influenced by the trend of music which is popular at that time.


The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones.



The dataset has three subsets:

  1. Train data: in the file trainIdx1.txt
  2. Validation data: in the file validationIdx1.txt
  3. Test data: in the file testIdx1.txt


All of those subsets are formatted as below:

First line is contains UserId and number of ratings of that users: <UsedId> | <#UserRatings>\n

For the following lines contain ratings of that user represented in single instances of ratings:


Except the testIdx1.txt, it does not have Score value.


Item Taxonomy

A unique feature of the datasets is a taxonomy annotating known relations between the items. The taxonomy of dataset is stored in the following four files:


trackData.txt - Track information formatted as:

<TrackId>|<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenereId_k>\n


albumData.txt - Album information formatted as:

<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n


artistData.txt - Artist listing formatted as:



genreData.txt - Genre listing formatted as:



Due to this sparse taxonomy and in order to preserve the related information of each item, better to stored the data into RDBMS. It will also help to maintain or retrieve subset data for the training purposes. The relationship between data imported into table are shown in the following diagram:


The dataset statistics are as follows:

1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940


Tools & Hardware

Database: Postgresql 8.4

Tools: WEKA as training tool, Wordpad and HxD as text editor tool

Programming Language: Java to perform additional function in importing dataset to Postgresql.

Hardware Specification: Intel Pentium 4 CPU 2.80 GHz under Linux OS. RAM 2 MB.


Issues & Handling

The issues related to this competition mostly regarded to dataset. These are the five issues:


1. High-dimensional data

Each instance of rating related to its information about albumId, genreId, artistId that may represent important value.To train all of those information, we need a high-dimensional data scheme to stored all of the related information. However, since the data dimensional become larger, the cost to process the data also become higher.

Solution: Try to reduce the data dimensionality. It can be done by train the data using some combination features (but it should not more than three features). For example: genre-artist, user-genre, track-genre, etc. The objectives of this approach is to find the most important features that will lead to better prediction and reduce the computational source by ignoring the less important features.


2. Large instance data

Total of rating instances is more than 250.000 K records. It seems impossible to train the whole data in limited hardware and tool capability. Even it can be done, it took a long time to accomplished.

Solution: Using the sample data that may represent the mostly important instance of records. Using data sampling means I have to generate a subset of train data that can be run in training tools. My approaches is to train 2 kind of subsets, first one contains the top 100 raters ratings, and the second one is the top 100 rated items. Implement this schema also not easy, since it took about 6 hours to produce those two subsets without any additional information (only for userid, itemid, and score).

Here is the statistic of the sample:


Top 100 Raters
Top 100 Rated Items
# Instances to train = 4,324,217 # Instances to train = 19,816,956


3. Unbalance data


4. Noise

Some of data given by KDD also contains null values. For examples: there are some instances of rating that have null value in userId, trackId, or even score. All of those noise data become meaningless, since we cannot predict those null value using any approach method. Due to that reason, I decided to simply discard the noise records in train dataset.


5. Missing data (incompleteness)

Some data in taxonomy subset also contains null values. For example: there are some instances of track that do not have information about its album or genre.


This null value can be replaced using another value. If the genreId of a track is missing, it can be replaced using the track average score in training process. As long as, the genre and track average score are not the combination features of that training.



1. Preprocessing Data

Data preprocessing means to prepare data before training process. Step of preprocessing data is mentioned below:


2. Training Method


Testing and Result

For the testing pursposes, I need to develop some baseline numbers for comparison.


1. Train Sample Data provided by KDD Cup

LibSVM seems promising to be implemented to train data since it took time less than MLP to provide a result in a minimum RMSE value (0.356).


2. Simple Baseline: Always guess 50

The first predictor I wrote was a Java code that simpy always guesses a rating of 50 for unseen products, for all users. Running this code gave an RMSE of 37.8262



  1. Item-based collaborative filtering recommendation algorithms. B.M. Sarwar, G. Karypis, J. A. Konstan, and J.Riedl. In Proc. Of WWW'01: 10th Int. Conf. on World Wide Web pages 285-295, 2001.
  2. Feature Engineering and Classifier Ensemble for KDD Cup 2010, Chih-Jen Lin, Department of Computer Science National Taiwan University.
  3. Collaborative Filtering, Predicting ratings, with a Flexible Mixture Model. Lanbo Zhang, Michael Cutter, Michael Percy. CMPS 242 March 15, 2009.