[+] | [-]

Introduction

Introduction :

In this year, KDD CUP provides Yahoo! Music dataset that people have been fascinated. Yahoo! Music has amassed billions of user ratings for musical pieces, include different types of items-songs, albums, artists, genres-all .These type are anonymous that use numeric values replacement.

In dataset offers that user ratings are given to entities of four different types: artists, albums, tracks, and genres. These items are related to each other, each artist has many genres and albums, and each album has many tracks. And this contest provides train data, validation data and test data. We must use train data to predict that each user can probably more like the kind of artists, albums, tracks, or genres; otherwise, each user can probably less like these types.

The competition is divided into two tracks:

Track1 : Main work predicts the scores for different user
This dataset has three subsets:
1. Train data. (filename is called "trainIdx1.txt")
2. Validation data. (filename is called "validationIdx1.txt") : each user can score to the four different items.
3. Test data. (filename is called "testIdx1.txt") : each user can score to the six different items.

For each subset, user rating data is grouped by user as follows. Each user is formatted as:
<UsedId>|<#UserRatings>\n
<ItemId>\t<Score>\t<Date>\t<Time>\n

<UsedId>|<#UserRatings> is described the number of rating data by UsedId.
<ItemId>, <Score>, <Date> and <Time> are taken on numeric values.

<ItemId> includes four different types: artists, albums, tracks, and genres in the file.
1. trackData.txt : (Track information formatted)
<TrackId>|<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n
2. albumData.txt : (Album information formatted)
<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n
3. artistData.txt : (Artist information formatted)
<ArtistId>\n
4. genreData.txt : (Genre information formatted)
<GenreId>\n

Four different types relationship:


<Score> are integers lying between 0 and 100.
<Date> are integers describing number of days elapsed since an undisclosed date.

The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones. We can use this website to offer "convertSubmissionTrack1.py" to quantize to one of the 256 evenly spaced numbers between 0 and 100.
(Within these files, missing values are encoded as the string "None".)