Neuron Network HW - KDD CUP 2011

B9632006許育翔

1. Problem Understanding

2. Issue

3. Survey

4. Proposed approach

5. Proof of concept

6. Scheduling

7. Progress report


1. Problem Understanding

In this year of KDD Cup, participants are asked to predict music scores rated by Yahoo! Music users. The competition is divided into two tracks:

Track1: Learning to predict users' ratings of musical items. Items can be tracks, albums, artists and genres. Items form a hierarchy, such that each track belongs to an album, albums belong to artists, and together they are tagged by genres.

Track2: Learning to separate tracks scored highly by specific users from tracks not scored by them. In track2 the test set includes six items per user (all are tracks), three of which were rated highly (score 80 or higher) by the user and three were not rated by the user. The three unrated items are sampled with a probability proportional to number of their high (>=80) ratings. The task is to classify each item as either rated or not rated by the user (1 or 0 respectively). A hierarchy of items similar to the one used in Track 1 is also given for Track 2. However, timestamps of the ratings, which are given in Track 1, are withheld for Track 2.

↑Top↑

2. Issue

Because of the hierarchy form of the datasets, it is quite normal to pre-process datasets into tree structures. But the datasets are extremely huge(track2 is even more than 5GB!), it seems that retaining whole data in memory is impossible. To find an efficiency way to process with big datasets is an important issue among the beginning of the contest.

Submission is limited to once every 8 hours, so to find an off-line validate method is also an important issue. The method must be effective for representing the whole dataset, and be considered for avoiding overfitting.

↑Top↑

3. Survey

Among professor Chih-Jen Lin and his team's last year experience[1], they found that both sparse and condensed features can achieve good results. To make the result even more better, they give weights to each method, and merge two results into one, which really lower the RMSE further more. They also use some tricks to explore the answers to the whole dataset. At last, they won the first prize due to their effort and luck.

Although the topic of this year is quite different from which of last year, what they had tried is still worth us to give it a try.

[1] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G. McKenzie, Jung-Wei Chou, Po-Han Chung, Chia-Hua Ho, Chun-Fu Chang, Yin-Hsuan Wei, Jui-Yu Weng, En-Syu Yan, Che-Wei Chang, Tsung-Ting Kuo, Yi-Chen Lo, Po Tzu Chang, Chieh Po, Chien-Yuan Wang, Yi-Hung Huang, Chen-Wei Hung, Yu-Xun Ruan, Yu-Shi Lin, Shou-de Lin, Hsuan-Tien Lin, Chih-Jen Lin, Feature Engineering and Classifier Ensemble for KDD Cup 2010, submitted to JMLR Workshop and Conference Proceedings.

↑Top↑

4. Proposed approach

I will choose Weka, Matlab and libsvm as my major classifying tools, and try different machine learning methods such as SVM, neuron network and so on. Find my best tuning configuration to each method, and combine the results to see whether there would be an even-more-better result turn out.

Also, I will try various coding methods for the datasets. For example, expanding dimensions and combining multiple features into one feature, to make the RMSE lower.

↑Top↑

5. Proof of concept

Most of my approach concepts are come from Feature Engineering and Classifier Ensemble for KDD Cup 2010 done by professor Chih-Jen Lin and his team. I believe that his idea on KDD Cup of the last year might still work in this year. And I have to seek out further key points to the topic of this year.

↑Top↑

6. Scheduling

Date Tasks
2011.03.01 *Registration Opens
2011.03.15 *Competition Begins
2011.03.31 Account Registed
2011.04.15 Data analysis complete
2011.04.30 Complete initial model
2011.05.10 Try other feature combinations
2011.05.25 Tune configurations
2011.06.10 Combine different results
2011.06.20 Result perfection
2011.06.30 *Competition Ends
2011.07.03 *Winners Notified

↑Top↑

7. Progress report

== 2010.05.01 ==

I've tried assigning all ratings in test data to 80, and returned 34.95 RMSE. I also tried the sample code provided by Yahoo. The codes only use the item id without the hierarchy structure. With stochastic gradient descent algorithm, it turned out 28.98 RMSE, which is much more better then assigning all ratings to 80.

Future work: base on the sample code, try to add the hierarchy structure into consideration. Tunes the weights of artists, albums, and genres. And because of the upload limit per 8 hours, I'll combine the existed validate dataset with randomly-chosen(20%) training dataset as an offline validating dataset.

== 2010.06.12 ==

I have rearranged the training data into a txt file with a format 'userID trackID albumID artistID 1st_genreID', and tried to import it in Matlab. But the Matlab txt analyser seemed to unable to handle such a large file, which just showed a blank screen and an unavailable "Next" button. To solve this problem, I separate each users' training data into different files, which turned out into 1,000,990 txt files.

Modified training file

A modified training data

While outputting those txt files, the output rate became much and much slower. At the 4th day after started outputting, it can only progress 3 more percentage per day. I think maybe it's because the NTFS can't handle so maby files in one directory. I halted the stucking program and modified the code, making it to output training files into hierarchy directories.

fileHierachy.png

I also separated each user's training label and test data into the same file structure, and the outputting rate decaying problem was no longer exist.

Now I am doing training with a batching code in Matlab. Because loading files frequently would add some overhead and make the training process slower, I execute 4 Matlab simultaneously and let them train different parts of the data, as a concept of multithreading, trying to accelerate the training speed.

Training with 4 Matlab

↑Top↑