Homework3_1_M9915803

Final Report KDDCUP 2011(Due date 12th June, 2011)

 

_____________________________________________________________________________________________

Outline:

1)         Introduction

2)         Issues & Solutions

3)         Tools and techniques

4)         Experiments & result

5)         Further improvements

6)         Reference

 

 

Introduction:

 

KDD CUP 2011, a musical rating contest organized by Yahoo group for users all over the world. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy. Most interesting issue of this contest has been the size of dataset. The released data represents a sampled snapshot of the Yahoo! Music community's preferences for various musical items. A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy.

 

Significant characteristics of the dataset include:

·         It is of a larger scale compared to other datasets in the field (over 300M ratings).

·         It has a very large set of items (over 600K) - much larger than any similar dataset, where usually only the number of users is large.

·         There are four different categories of items, which are all linked together within a defined hierarchy. This is particularly important considering the large number of items.

 

Total dataset is categorized as two distinct tracks namely track 1 and track 2.

 

For Track 1 dataset is split into three subsets:

·  Train data: in the file trainIdx1.txt

·  Validation data: in the file validationIdx1.txt

·  Test data: in the file testIdx1.txt

 

For each subset, user rating data is grouped by user. First line for a user is formatted as:

<UsedId>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by <UsedId>, sorted in chronological order. Rating line format is:

<ItemId>\t<Score>\t<Date>\t<Time>\n

 

An item has at least 20 ratings in the total dataset (including train, validation, and test sets).

Among total 20 rating:

User has at least 10 ratings in training data, 4 rating in validation data and 6 ratings in test data.

 

The dataset statistics are as follows:

#Users

#Items

#Ratings

#TrainRatings

#ValidationRatings

#TestRatings

1,000,990

624,961

262,810,175

252,800,275

4,003,960

6,005,940

 

Similarly for Track 2 dataset is split into two subsets:

·         Train data: in the file trainIdx2.txt

·         Test data: in the file testIdx2.txt

 

At each subset, user rating data is grouped by user. First line for a user is formatted as:

                                                 <UsedId>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by <UsedId>. Rating line format:

<ItemId>\t<Score>\n

Track 2 also consists of 20 ratings including train data and test data. User has at least 17 ratings for train data and rest for test data.

Size of Track 2 dataset is significantly smaller than Track1. Its dataset statistics are as follows:

#Users

#Items

#Ratings

#TrainRatings

#TestRatings

249,012

296,111

62,551,438

61,944,406

607,032

 

 

Issues & Solutions:

 

Users give rating to the music items, i.e. tracks, albums, artist, etc. And rating is affected by the type of user and interest. These are many items that remain unrated, in such case. Main issue in such scenario is to give the user rating to the unrated music items on the basis of user rated music items. Beside that dataset is of a larger scale compared to other datasets in the field. We need to generate proper relation and hierarchy between different attributes of music items.

 

Significant issues of KDDCUP 2011 have been:

1)    Large datasets

2)    Unbalanced dataset

3)    Feature selection

4)    Unknown data

5)    Sparse dataset

6)    Temporal dataset

7)    Normalization

 

1)    Large dataset:

This has been the biggest issue in KDD CUP 2011, over 300M user ratings are available. Due to this huge amount of dataset it is a challenge it train the whole data, due to the lack of memory size of PC and time consumption resulting in malfunction of PC.

So having being faced such challenges, I tried to figure out the relational schema between different attributes of dataset. Besides this working on average data rather than individual data made things relatively less complex on handling these huge data.

 

2)    Unbalanced dataset:

There have been some issues with unbalanced dataset between train, validation and test data. In such scenario dataset was balanced my using attributes with null values. Sampling has also been better solution to this issue.

 

3)    Feature selection:

To select the good features of dataset is also a challenge. I tried to collect some userIds with consistent rating over the period of time as a feature selection. Since feature selection is very necessary having being so much of huge data, we need to take out sample data on the basis of which we could give effective rating to unrated items.

 

4)    Unknown data:

Many part of the dataset consist of unknown values, such unknown data were ignored to minimize the error rate.

 

5)    Sparse dataset:

In multidimensional dataset, combinations of members from dataset’s dimension contain missing data. So to avoid this case I used postgreSQL which provide the ease of use over current horizontal storage and vertical system.

 

6)    Temporal dataset:

It has been another significant issue because music rating is always affected by time constraint. Rating given to the music when it is just released is different from those released some months ago. In such case rating itself seem to be vague issue. It is really difficult to differentiate user rating with time frame in the provided dataset.

 

7)    Normalization:

In our dataset there is one character “|” which need to be ignored. Performance can be improved only when data is modified by removing insignificant characters

 

 

Tools and Techniques:

 

Tools:

l  PostgreSQL to handle database

l  Java Programming Language (NetBeans IDE) to build application

l  WEKA toll(Multilayer perceptron to process data, with back propagation algorithm to train data)

 

Techniques:

1)    Sampling:

First sample dataset was used to experiment the training and testing of data. Since dataset is huge it looks almost impossible to train data due to limited memory size in my PC, due to that reason I tried to take out the sample from the original dataset on the basis of best features. However it was not that reliable to say it effective but best features were selected on the basis of consistent rating provided by user.

2)    Taking average:

Since single user can give numerous ratings, and also single item has so many ratings, so on the basis of this information I calculated the average scores so that data size also become relatively smaller and more easy to handle.

3)    Feature selection:

Certain features that were more relevant to the rating were selected. Some attributes like date and time were ignored. However date can have some impact since musical rating is affected by time constraint, but having huge amount of data and looking at the impact among other attributes, date was also ignored.

 

Experiments & result:

 

Working on dataset of Track 1 to figure out the appropriate relational schema between different attributes of datasets. Used WEKA tool to train dataset and calculate RMSE (Root Mean Square Error).

 

image008

image003

posgre

New Picture (12)

 

 

After query execution I figure out average score value and converted it to binary values using converter in C++ provided by KDD dataset. I couldn’t submit many times due to lack of time and unavailability of enough memory to make system run faster. RMSE value that I obtained from my predicted score is shown below.

kdd

 

Score of 37% is far behind the front line competitors; however there is still 3 weeks for final deadline in rating submission. I will try to improve the score if I get time.

 

Further Improvements:

 

Although use of multilayer perceptron makes it easy to process data but it is really a time consuming technique. So looking at this scenario SVM can be a good option to work with his data.

 

References :

 

i) http://kddcup.yahoo.com/

ii) http://en.wikipedia.org/wiki/Weka_(machine_learning)

iii) Jun Wang , Arjen P. de Vries , Marcel J. T. Reinders, Unified relevance models for rating prediction in    collaborative filtering, ACM Transactions on Information Systems (TOIS), v.26 n.3, p.1-42, June 2008

iv) Juan R Rabunal, Julian Dorado, Artificial Neural Networks in Real Life Applications, Hershey USA, 2006

v) Albert Metz, Richard Cantor, Moody's credit rating prediction model, 2006

vi) Yew Jin Lim, Yee Whye Teh, Variational Bayesian Approach for movie rating prediction, KDD Cup, 2007