2011 KDD Cup Plan- Learn the rhythm, predict the musical scores

by Chris Chien (簡延庭),張世穎 and 林祐靖 

1. Problem Understanding
2. Issue
3. Survey
4. Proposed Approach
5. Proof of Concept
6. Scheduling
7. Others
 

1. Problem Understanding

    This year 2011 KDD Cup is starting. The contest of this year, Yahoo! provides a wealth of information about the Music.

Those data is very large and have four types such as tracks, albums, artists and genres. I look at this information I feel exciting. But we

have some work to do for Yahoo! providing informations. We must predict users' ratings of musical items in Track1 (Yahoo! offers two

datasets, Track1 and Track2.) and separate tracks scored highly by specific users from tracks not scored by them in Track2. So we have t

wo datasets must to process. I think we must spend some time to research these dataset in spite of this website(2011 KDD CUP) already

provides some information.


2. Issue

    According to Yahoo! givivg some tips, we can arrange some questions.

1. Can we know about the format of these dataset ?

Ans: Yes, more detail( see )

2. How will we process those dataset ?

Ans: I think first step is to analyse the dataset such as the dataset format, contents of the dataset and attributes of those item in the dataset

and so on. I consider those dataset must preprocess.

3. Can we use which methods or alogorithms to process these dataset to achieve our purpose?

Ans: To tell the truth, at present we don't have methods to process these data yet. Because we are late to start to discuss this issue. This semester

we have many subjects, homeworks and our professor asking us doing something, but we have a schedules ( 6. Scheduling), and we must to follow

this.


3. Survey

    First, we wil review this book whose name is "Data.Mining.Concepts.and.Techniques.2nd.Ed". We believe this book maybe gives us some ideas.

Second, the problem of this year is about music, so we can look for some informations through the Internet such as Google search. I think that helping

us.

Third, we will use our school libary to find some papers which is about Music search.


4. Proposed Approach (updated 5/31)

1. 到5/1的討論
     因為ItemId由(track, artist, genre, album)這四種可能性所組成,而在traindata裡每一位user會評分若干筆Itemdata, 而這個itemdata有可能是track, artist, genre,或album。首先我們相信到這四類的值是互斥的,接下來我們寫一個程式做統計的動作,將每一個user裡所評分的類型,不管是track, artist, genre, 還是album, 我們都來做分類。例如:user代表是一個置物櫃,所以每個user都有一個置物櫃,置物櫃裡有四個抽屜,一個代表track,一個是album,一個是artist,最後一個則是genre。接下來我們就要決定在traindata裡每一個itemid要放到哪一個抽屜。在一開始我們己經說過這是一個置物箱,而置物箱就代表一個user,因此我們只要決定user裡的item,要放到某個置物櫃裡的某個抽屜,也就是說:x user, 放入x置物櫃。a:track 放入track 抽屜;b:album 放入album抽屜 , 依此類推。當我們做完上述分類之後,接下來我們就針對每個抽屜做進一步的統計處理。首先我們用一個例子來做說明:假設我們現在處理x user,令a:track, b:album, c:artist and d:genre, 抽屜令A:Track, B:Album, C: Artist, D: Genre, 然後我自分別對A, B, C, D裡的item做統計,例如:在A抽屜,我們對每一個a去尋找其其他資訊,相似哪一張album, 哪一位artist, 它屬於哪幾種genre等…將這些資訊記下來,以便我們之後接下來的處理。
2.到5/11的討論
    根據上次的討論,我們就開始我們的工作。
    以下是我們的工作分配:

    a. 世穎負責整理"albumData1.txt";把它整理成以一個專輯一個專輯的單位。
    b. 祐靖負責整理"genreData1.txt";把它整理成以一個曲風一個曲風的單位。
    c. 延庭負責整理"artistData1.txt";把它整理成一位歌手一位歌手的單位。

    而在做這三個工作之前延庭已經整理出itemId與歌手、曲目、專輯和曲風各別的關係,如以下說明:

    首先我們發現各單位的序號並不是連續的,例如:



    我們發現0~7雖然是連續的,但是8跟11之間卻少了9和10。後來我們觀察了一下,才知道9是專輯的itemId;10是artist的itemId。因此延庭同學,就先將這些itemId先做整理成以下的樣式:



    這個mapping_table要如何看呢?首先我們在上圖可以清楚看到紅色的編號及黑色的編號。紅色連續性的編號是一一對應到黑色的編號。黑色編號看起來雖不連續,但其實它們的屬性都是一樣的,例如:紅色編號的1~507173對應過去的黑色編號都是trackId。以下及是各屬性在紅色編號的範圍:

trackId:1~507172


albumId:507173~596081


artistId:596082~623969


genreId:623970~624961
3.到5/18的討論
    a.首先大家先各自報告進度:
    世穎同學:
    寫程式整理出"albumDir.txt"這個檔案。如下:



    每個A到下一個A之間的每一行代表屬於這張專輯的track。例如:專輯9,有112138(trackId)|7863(artistId)|600770|584872|247563(都是genreId)...等12行都是專輯9內的資訊。

    祐靖同學:
    寫程式整理出"genreDir.txt"這個檔案。如下:



    每個A到下一個A之間的每一行代表屬於這曲風的track。例如:曲風14741,有437(trackId)|586314(albumId)|226747(artistId)...等都是曲風14741內的資訊。

    延庭同學:
    寫程式整理出"artistDir.txt"這個檔案。如下:



    每個A到下一個A之間的每一行代表屬於這歌手的所有track。例如:歌手109,有10745(trackId)|319975(artistId)|283375(都是genreId)...等都是歌手109內的資訊。

    b.接下來我們開始討論我們要用什麼方法做預測分數的動作。我們基本的想法來自5/1的討論。花了一些時間在這上面。
4. 到5/23的討論
    a.今天延續上週後面的討論,做個總結。最後想出了一個預測分數的方法,此方法我們認為滿直覺的,但是也可以直接感覺出執行上會滿花時間的。

    b. 最後我們將想出的方法先畫成流程圖,接下來我們針對此流程圖做解釋。




    c. 這次討論之後的工作分配:

        i.由延庭同學和世穎同學去寫這次討論完用的方法之程式;
        ii.祐靖同學寫一個切檔程式
5.到5/27的討論
    a.首先大家先各自報告進度:
    世穎與延庭同學:
    用上次討論的方法寫完程式。如下:


輸入要預測的檔案



得到預測的分數
    b. 我們將 trainIdx1.txt 和 testIdx1.txt 各分別切成100099個檔案,然後用4台四核心以上的電腦連續跑4天後,同樣也產生100099的檔案,總共預估6005940個分數,再使用官方提供的程式碼做格式轉換後上傳。
 
 
 
 
 

 


5. Proof of Concept

    6/1的submission


6. Scheduling

    We has known some important dates that have posted on the offical website. And we also keep preparing our work. It is our team schedule as

follows:

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
4/10 4/11 Group meeting 4/12 4/13 Group meeting 4/14 4/15 Submission result 4/16
4/17 4/18 Group meeting 4/19 4/20 Group meeting 4/21 4/22 Submission result 4/23
4/24 4/25 Group meeting 4/26 4/27 Group meeting 4/28 4/29 Submission result 4/30
5/1 5/2 Group meeting 5/3 5/4 Group meeting 5/5 5/6 Submission result 5/7
5/8 5/9 Group meeting 5/10 5/11 Group meeting 5/12 5/13 Submission result 5/14
5/15 5/16 Group meeting 5/17 5/18 Group meeting 5/19 5/20 Submission result 5/21
5/22 5/23 Group meeting 5/24 5/25 Group meeting 5/26 5/27 Submission result 5/28
5/29 5/30 Group meeting 5/31 6/1 Group meeting 6/2 6/3 Submission result 6/4
6/5 6/6 Group meeting 6/7 6/8 Group meeting 6/9 6/10 Submission result 6/11
6/12 6/13 Group meeting 6/14 6/15 Group meeting 6/16 6/17 Submission result 6/18
6/19 6/20 Group meeting 6/21 6/22 Group meeting 6/23 6/24 Submission result 6/25
6/26 6/27 Group meeting 6/28 6/29 Submission result 6/30 7/1 7/2

    Note: June 30, 2011 Competition Ends


7. Others

    Now we will introduce our team member. We come from the Speech and Voice Processing Lab in National Taiwan University of Science and

Technology in Taiwan. We are all master students of the first year. We have three members including Chris( 簡延庭), LIN, YU-CHING(林佑靖) and (

張世穎).

Our assignment is as follows:

Chris Chien (簡延庭): Coding, Update the result and the website, Research some questions.

( 林祐靖): Find out some informations about the Music,including some paper , Reading papers and reporting what he sees.

( 張世穎): Coding, Research some questions, Reading papers and reporting what he sees.