Progress Report KDD Cup Project 2011

Ratih NE Anggraini - M9915804

 

Problem Definition

Proposed Approach

Tools

Issues

Result

References

Problem Understanding

Yahoo! Music has collected billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to. KDD Cup 2011 challenges contestans to find this patterns link. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.

Track1 :

In this track, we are asked to learn to predict users's ratings of musical items in test data. Items can be tracks, albums, artists and genres. Items form a hierarchy, such that each track belongs to an album, albums belong to artists, and together they are tagged by genres. It is aimed at predicting scores that users gave to various items. We can analyze the link patterns in the train data to predict the ratings in the test data.

Track2 :

We need to learn how to separate tracks scored highly by specific users from tracks not scored by them. In track2 the test set includes six items per user (all are tracks), three of which were rated highly (score 80 or higher) by the user and three were not rated by the user. The three unrated items are sampled with a probability proportional to number of their high (>=80) ratings. The task is to classify each item as either rated or not rated by the user (1 or 0 respectively). The goal of such a task would be differentiating high ratings from missing ones, which requires extending the generalization power of the learning algorithm to the truly missing entries, as required in real life scenarios. A hierarchy of items similar to the one used in Track 1 is also given for Track 2. However, timestamps of the ratings, which are given in Track 1, are withheld for Track 2. It requires separation of loved songs from other songs

Proposed Approach

Artificial Neural Networks

Artificial Neural Network (ANN), or usually just called neural network, is a mathematical or computational model inspired by the structure and functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.

The word network in the term 'artificial neural network' refers to the inter–connections between the neurons in the different layers of each system. An example system has three layers. The first layer has input neurons, which send data via synapses to the second layer of neurons, and then via more synapses to the third layer of output neurons. More complex systems will have more layers of neurons with some having increased layers of input neurons and output neurons. The synapses store parameters called "weights" that manipulate the data in the calculations.

An ANN is typically defined by three types of parameters:

  1. The interconnection pattern between different layers of neurons
  2. The learning process for updating the weights of the interconnections
  3. The activation function that converts a neuron's weighted input to its output activation.

 

Resilient Backpropagation

RPROP, short for resilient backpropagation, is a learning heuristic for supervised learning in feedforward artificial neural networks. This is a first-order optimization algorithm. This algorithm was created by Martin Riedmiller and Heinrich Braun in 1992.

The RPROP algorithm is usually the most efficient training algorithm provided by Encog for supervised feedforward neural networks. One particular advantage to the RPROP algorithm is that it requires no setting of parameters before using it. There are no learning rates, momentum values or update constants that need to be determined. This is good because it can be difficult to determine the exact learning rate that might be optimal.

The RPROP algorithms works similar to the Manhattan update rule, in that only the magnitude of the descent is used. However, rather than using a fixed constant to update the weights and threshold values, a much more granular approach is used. These deltas will not remain fixed, like in the Manhattan update rule or backpropagation algorithm. Rather these delta values will change as training progresses.

The RPROP algorithm does not keep one global update value, or delta. Rather, individual deltas are kept for every threshold and weight matrix value. These deltas are first initialized to a very small number. Every iteration through the RPROP algorithm will update the weight and threshold values according to these delta values. However, as previously mentioned, these delta values do not remain fixed. The gradient is used to determine how they should change, using the magnitude to determine how the deltas should be modified further. This allows every individual threshold and weight matrix value to be individually trained. This is an advantage that is not provided by either the backpropagation algorithm or the Manhattan update rule.

Tools

Encog Framework

Encog is an Artificial Intelligence (AI) Framework for Java and .NET. Though Encog supports several areas of AI outside of neural networks, the primary focus of Encog 2.x is Neural Networks Programming.

Other Tools

  1. MS Visual Studio 2010 C#.Net as the programming language
  2. XAMPP MySQL as the database

Issues

The main issues in KDD Cup are the dataset.

  1. High dimension of data
    • The data has many hierarchical features
  2. Large instance data
    • Data training, validation and test have millions instances.
    • I try to overcome this issue by training several data instead of all data. I trained only 10 data peruser.
    • But the computer resources still have some burden and can not handle massive amount of data. It encounter several times error during data training.
  3. Incomplete (missing data)
    • Some data have missing values that can affect training.
  4. Noise
  5. Unbalance data

Result

I encounter many times error during data training so I can not complete the training. Therefore I try another approch to get the score by calculate the average score from data training then applied it as the score in the data test. The following picture is the RMSE of test using average score from data training.

Result

References

  1. http://en.wikipedia.org/wiki/Rprop accessed on May 30, 2011
  2. "Introduction to Encog 2.5 for C#", Heaton Research, Inc, 2010
  3. http://www.heatonresearch.com/encog accessed on April 29, 2011