992 NN HW#3 ( KDD Cup 2011 Plan )

Advisor

Prof Hahn-Ming, Lee

Student ID

M9915080

Student Name

Chien-Wei, Lee

 

 

 

Abstract

 

 

The first task in KDD Cup 2011 is "Learn the rhythm and predict the musical scores".

 

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups.

 

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

 

Tasks

 

 

The competition is divided into two tracks :

 

Track1: Learning to predict users' ratings of musical items.

 

Items can be tracks, albums, artists and genres. Items form a hierarchy, such that each track belongs to an album, albums belong to artists, and together they are tagged by genres.

 

Track2: Learning to separate tracks scored highly by specific users from tracks not scored by them.

 

In track2 the test set includes six items per user (all are tracks), three of which were rated highly (score 80 or higher) by the user and three were not rated by the user. The three unrated items are sampled with a probability proportional to number of their high (>=80) ratings. The task is to classify each item as either rated or not rated by the user (1 or 0 respectively). A hierarchy of items similar to the one used in Track 1 is also given for Track 2. However, timestamps of the ratings, which are given in Track 1, are withheld for Track 2 .

 

 

Evaluation

 

 

The test sets for both Track1 and Track2 are divided into two disjoint equal sets each: Test1 and Test2. Examples in Test1 are used for calculating the scores shown on the Leaderboard: RMSE for Track1 and Error rate for Track2. The examples of Test2 are reserved for choosing the winners of the competition. Hence, the possibility exists that team rankings on the Leaderboard will differ from the final results, which would be calculated on Test2.

 

 

Datasets

 

 

Track 1

The dataset is split into three subsets:

      - Train data: in the file trainIdx1.txt

      - Validation data: in the file validationIdx1.txt

      - Test data: in the file testIdx1.txt

For each subset, user rating data is grouped by user. First line for a user is formatted as:

      <UsedId>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by <UsedId>, sorted in chronological order. Rating line format is:

      <ItemId>\t<Score>\t<Date>\t<Time>\n

The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.

----------------------------------------------------------------------

Track 2

The dataset is split into two subsets:

      - Train data: in the file trainIdx2.txt

      - Test data: in the file testIdx2.txt

At each subset, user rating data is grouped by user. First line for a user is formatted as:

      <UsedId>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by <UsedId>. Rating line format:

      <ItemId>\t<Score>\n

The scores are integers lying between 0 and 100, and are withheld from the test set. All user id's and item id's are consecutive integers, both starting at zero.

----------------------------------------------------------------------

Item Taxonomy

A unique feature of the datasets is a taxonomy annotating known relations between the items. Such a taxonomy is expected to be particularly useful here, due to the large number of items and the sparseness of data per item (mostly attributed to "tracks" rather than to "artists").

Recall that item id's can represent tracks, albums, artists or genres. The type of each item, including a hierarchical structure linking tracks, albums, artists and genres, is stored (separately for each the two datasets) in the following four files:

      trackData.txt - Track information formatted as:

<TrackId>|<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n

      albumData.txt - Album information formatted as:

<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n

      artistData.txt - Artist listing formatted as:

<ArtistId>\n

      genreData.txt - Genre listing formatted as:

<GenreId>\n

Within these files, missing values are encoded as the string "None".

 

 

Methods & Tools

 

 

 Step 1Feature Selection

 

Step 2Training data sampling, preprocessing and reduction

Step 3Classification by using MATLAB and WEKA

Step 4Prediction & Analysis

 

 

Schedules

 

 

 1. Planning

 2. Feature Selection

 3. Training data sampling and preprocessing

    Using MySQL dadabase with java programming,

    Learning how to use Java to insert the training data to MySQL

db

 

 4. Training data reduction

 5. Classification by using MATLAB and WEKA

 6. Analyze the results

 

 

Proposed Approach

 

 

 1training data 集中管理 by using MySQL

 

 2 :  學習每筆資料間的相似度

      i. KNN ( K=1) algorithm的概念嘗試找出資料之間的相似度

      ii. 計算distance factors 包含 artist, album, track, genre,

         data and time

      iii. 考量各 factors 的權重

 

Proposed Approach II ( Retry )

 

1training data 集中管理 by using data base - MySQL

 

2. 使用工具Weka

    i. 將每筆train datanormalization :

 

0

透過data basequery, item id換成音樂的attribute,

 

       我個人觀察認為,每個attribute具有層級關係,

       在分析理解之後,將關係以下圖表示:

1

Attributes 的層級關係圖

 

        箭頭指向為該attributechild, 反之為parent,

        若該attributeleaf node, 則進一步向database 取回

        其所有successors.

 

        Examples :

2

item id經查詢後,track id,

所以再取其artist, albumgenre

最後另儲存一分展開後的資料

 

3

Item id artist, 則將album, track genre

填上None

 

    ii. 將展開後的資料收集成Wekainput file.

    c

因為train data資料量龐大在收集程式中,時間上設定一個time limit所做的嘗試,

圖為此資嘗試整理成Wekainput

   iii. Weka的執行結果:

 

Capture_06122011_025417

 

Capture_06122011_025425

 

 

Capture_06122011_025440

 

Capture_06122011_025442

 

Capture_06122011_025445

 

Capture_06122011_025447

 

Capture_06122011_025449

 

Capture_06122011_025452

 

Capture_06122011_025454

 

Capture_06122011_025456