Netflix Prize

Overview

When I heard about the Netflix Prize, I have to admit that I couldn't resist joining. The stated goal of this contest is to help Netflix improve their movie recommendation system. The team that can beat Netflix's own home-grown collaborative filtering system by 10% will win a million dollars. Like many others, I have doubts as to whether this feat is possible given the sparsity of data and inherent noise. Some speculate that the real goal is to fend off several software patents by showing prior works. It doesn't matter much in either case, as I don't plan on winning this one :).

I joined this contest purely as a learning experience and to test out several new tools using this new wealth of demo data. At the time of writing, I am back on the leaderboard at 4% better than Netflix, but only due to some gracious hints from some of the other contestants. The methods and results of my investigation are presented below. Also check out this demo application. Hopefully this info and sample code will be of help to others.

The Basics

After you join the contest, you are presented with over 2 gigs of csv files containing 100 million ratings. These ratings were made by 480K (anonymous) users on a library of almost 18K movies. Your goal is to make sense of the data and spit out a 20 mb file of results for a set of ratings not contained in the sample. These are then compared against the users' actual ratings and you are given a score based on your error.

While 100 million ratings may sound like a lot, the more data the better. There would actually be 8.5 BILLION ratings if every user rated every movie. That means we have just over 1% of a complete sample. While that amount may be good for many problem domains, in this case we have many users with only a few ratings and others who have rated over 10,000.

The charts below show some of the interesting aspects of the source data. They are presented using a
charting solution I am evaluating and require a Flash player (even a 3 year old version will do).

Overal Rating Distribution

The following simple pie chart shows the relative mix of 1-5 ratings in the sample. The most common rating found in the history was 4 and it occurred 33% of the time.

Average Rating Distribution

The following chart shows the number of users and movies with a given average (rounded to the nearest 0.5). Here both 6K movies and almost 200K customers had an average rating of 3.5.

Rating Count Distribution

This chart shows the percentage of users and movies having AT MOST a given number of ratings. For example: 50% of customers had less than 92 ratings. 60% of movies had less than 1000 ratings.

Average Movie Rating By Movie Count

Here we can see that there is a positive correlation between a movie's scores and the number of ratings. This makes sense as not many people would waste their time renting a movie they know is bad :)

Standard Deviation By Movie Average

Interestingly we can see that movies judged extremely good or bad are rated consistently good or bad. This is one good indicator for a theoretical max to rating improvement. When taking a random rating from the most consistently loved movie, the average error from its mean is still almost 0.72.

The First Attempt

Having experience with sales and labor forecasting, my first impulse was to load all of the data into a database. (Ironic side note: several years ago I worked on tools used by Netflix's brick-and-click rival). Within a day of entering, I tried using some basic techniques including manually filtering out some boundary cases (like jokers who rate all movies 1's) and applying some simple adjusted averages. This solution was very clean and quick, but not nearly the most accurate. An RMSE of 0.98 (3% behind Netflix) was able to propel me to the top of the leaderboard, but soon dropped off the charts. The following is an example of the SQL I used:

DECLARE @movie_stdev FLOAT; SET @movie_stdev = (SELECT AVG(stdev) FROM Movie)

UPDATE Probe
   SET rating = (CASE WHEN c.stdev < 0.2 AND c.rating_count > 5 THEN c.rating_avg
                 ELSE m.rating_avg + c.rating_offset/(@movie_stdev/m.stdev) END)
  FROM Probe p
  JOIN Movie m
    ON p.movie_id = m.movie_id
  JOIN Customer c
    ON p.customer_id = c.customer_id

UPDATE Probe SET rating = 5 WHERE rating >= 5
UPDATE Probe SET rating = 1 WHERE rating <= 1
UPDATE Probe SET rating_diff = rating_actual - rating

/* Calculate RSME */
SELECT SQRT(SUM(SQUARE(rating_diff)) / COUNT(*))
  FROM Probe p

A Second Try

Next I tried using correlation to find 'look-alike' movies and customers. The idea is that if someone really likes the first Spiderman movie, odds are they will like the sequel as well. Additionally, if two customers have given similar ratings in the past, there is a good chance they will continue to do so in the future. As there are a much larger number of customers than movies, it proved very difficult to find enough common rental history to compare customers. I then switched to correlating movies and had better results. The following are some of the examples found by the algorithm. The interesting thing to note is that the system did not use any data about the movies' name, category, or actors. Everything it concluded was based solely on the likelyhood of the two movies appearing together and the similarity of their ratings.

Movie	Look-alike	Similarity Score
The Fellowship of The Ring	The Two Towers	3.21760
The Fellowship of The Ring	The Return of the King	2.69326
Friends: Season 1	Best of Friends: Season 2	1.94697
Friends: Season 1	Friends: Season 3	1.92805
Friends: Season 1	Friends: Season 4	1.76589
Blue's Clues: Bluestock	Blue's Clues: Blue Talks	0.96546
Blue's Clues: Bluestock	Blue's Clues: Blue's Room Snacktime	0.95598
Blue's Clues: Bluestock	Blue's Clues: Blue Takes You to School	0.84640
Sixteen Candles	The Breakfast Club	0.76837
Sixteen Candles	Pretty in Pink	0.68470
Bowling for Columbine	Fahrenheit 9/11	0.63799

On the probe set these kinds of correlations had an error of only 0.74. While that was great news, my algorithm only found a small number of unique pairings (28,000). When averaged with all of the unpaired movies, my overall score improved by less than a point. Others had more success using Slope One and Pearson Coefficient approaches to correlation, so in the future I would still like to pursue this path further.

Third Try is the Charm

Based on my prior results, I knew I had to find a solution that could handle sparsity well. Ideally I would like to reduce the set of 17770 unique movies to a small number of categories that would be easier to correlate. From there I started reading several academic papers on dimensionality reducing techniques (including Principal Components Analysis, Singular Value Decomposition, and Latent Semantic Analysis). While I was very excited by the prospects, my math skills were a little rusty and I didn't have the time to get back up to speed. So I did the prudent thing for industry types and waited for others to publish their results :). Thankfully, with hints from several of the contest leaders (and some especially detailed descriptions from Simon Funk's blog), I was able to implement an algorithm good enough to get back onto the leaderboard (at 0.9132 and 4% better than Netflix).

Below are some of the (dimensional) categories inferred by the algorithm. Customers who like movies on one extreme are more likely to hate movies on the other. While they do turn out to be statistically relevant, the categories don't map to what we'd normally think of as Dramas or Comedies. I have tried to label movies on each end of the spectrum as best as I am able.

Dimension 1

Offbeat / Dark-Comedy	Mass-Market / 'Beniffer' Movies
Lost in Translation	Pearl Harbor
The Royal Tenenbaums	Armageddon
Dogville	The Wedding Planner
Eternal Sunshine of the Spotless Mind	Coyote Ugly
Punch-Drunk Love	Miss Congeniality

Dimension 2

Good	Twisted
VeggieTales: Bible Heroes: Lions	The Saddest Music in the World
The Best of Friends: Season 3	Wake Up
Felicity: Season 2	I Heart Huckabees
Friends: Season 4	Freddy Got Fingered
Friends: Season 5	House of 1

Dimension 3

What a 10 year old boy would watch	What a liberal woman would watch
Dragon Ball Z: Vol. 17: Super Saiyan	Fahrenheit 9/11
Battle Athletes Victory: Vol. 4: Spaceward Ho!	The Hours
Battle Athletes Victory: Vol. 5: No Looking Back	Going Upriver: The Long War of John Kerry
Battle Athletes Victory: Vol. 7: The Last Dance	Sex and the City: Season 2
Battle Athletes Victory: Vol. 2: Doubt and Conflic	Bowling for Columbine

I will be posting more details of my results as time goes on. The biggest challenge right now is the tedious trial and error tuning of parameters (a full run can take as long as 8 hours). The plot below shows the results of training the first four features and differences between using an intelligent average at the start of training vs using all 1's:

Source Code

This is probably what most of you are interested in, and rightfully so. :)
Though please share your results. I'd like to plot a graph of all of the different tuning parameters and their outcomes. The current parameters will give you one of the sets of results I blended to get my score.

//=============================================================================
//
// SVD Sample Code
//
// Copyright (C) 2007 Timely Development (www.timelydevelopment.com)
//
// Special thanks to Simon Funk and others from the Netflix Prize contest 
// for providing pseudo-code and tuning hints.
//
// Feel free to use this code as you wish as long as you include 
// these notices and attribution. 
//
// Also, if you have alternative types of algorithms for accomplishing 
// the same goal and would like to contribute, please share them as well :)
//
// STANDARD DISCLAIMER:
//
// - THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
// - OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
// - LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
// - FITNESS FOR A PARTICULAR PURPOSE.
//
//=============================================================================

#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <tchar.h>
#include <map>
using namespace std;

//===================================================================
//
// Constants and Type Declarations
//
//===================================================================
#define TRAINING_PATH   L"C:\Netflix\training_set*.txt"
#define TRAINING_FILE   L"C:\Netflix\training_set\%s"
#define FEATURE_FILE    L"C:\Netflix\features.txt"
#define TEST_PATH       L"C:\Netflix\%s"
#define PREDICTION_FILE L"C:\Netflix\prediction.txt"

#define MAX_RATINGS     100480508     // Ratings in entire training set (+1)
#define MAX_CUSTOMERS   480190        // Customers in the entire training set (+1)
#define MAX_MOVIES      17771         // Movies in the entire training set (+1)
#define MAX_FEATURES    64            // Number of features to use 
#define MIN_EPOCHS      120           // Minimum number of epochs per feature
#define MAX_EPOCHS      200           // Max epochs per feature

#define MIN_IMPROVEMENT 0.0001        // Minimum improvement required to continue current feature
#define INIT            0.1           // Initialization value for features
#define LRATE           0.001         // Learning rate parameter
#define K               0.015         // Regularization parameter used to minimize over-fitting

typedef unsigned char BYTE;
typedef map<int, int> IdMap;
typedef IdMap::iterator IdItr;

struct Movie
{
    int         RatingCount;
    int         RatingSum;
    double      RatingAvg;            
    double      PseudoAvg;            // Weighted average used to deal with small movie counts 
};

struct Customer
{
    int         CustomerId;
    int         RatingCount;
    int         RatingSum;
};

struct Data
{
    int         CustId;
    short       MovieId;
    BYTE        Rating;
    float       Cache;
};

class Engine 
{
private:
    int             m_nRatingCount;                                 // Current number of loaded ratings
    Data            m_aRatings[MAX_RATINGS];                        // Array of ratings data
    Movie           m_aMovies[MAX_MOVIES];                          // Array of movie metrics
    Customer        m_aCustomers[MAX_CUSTOMERS];                    // Array of customer metrics
    float           m_aMovieFeatures[MAX_FEATURES][MAX_MOVIES];     // Array of features by movie (using floats to save space)
    float           m_aCustFeatures[MAX_FEATURES][MAX_CUSTOMERS];   // Array of features by customer (using floats to save space)
    IdMap           m_mCustIds;                                     // Map for one time translation of ids to compact array index

    inline double   PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing=true);
    inline double   PredictRating(short movieId, int custId);

    bool            ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut);
    bool            ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue);
    bool            ParseFloat(wchar_t* pwzBuffer, int nLength, int &nPosition, float& fValue);

public:
    Engine(void);
    ~Engine(void) { };

    void            CalcMetrics();
    void            CalcFeatures();
    void            LoadHistory();
    void            ProcessTest(wchar_t* pwzFile);
    void            ProcessFile(wchar_t* pwzFile);
};


//===================================================================
//
// Program Main
//
//===================================================================
int _tmain(int argc, _TCHAR* argv[])
{
    Engine* engine = new Engine();

    engine->LoadHistory();
    engine->CalcMetrics();
    engine->CalcFeatures();
    engine->ProcessTest(L"qualifying.txt");

    wprintf(L"\nDone\n");
    getchar();
    
    return 0;
}


//===================================================================
//
// Engine Class 
//
//===================================================================

//-------------------------------------------------------------------
// Initialization
//-------------------------------------------------------------------

Engine::Engine(void)
{
    m_nRatingCount = 0;

    for (int f=0; f<MAX_FEATURES; f++)
    {
        for (int i=0; i<MAX_MOVIES; i++) m_aMovieFeatures[f][i] = (float)INIT;
        for (int i=0; i<MAX_CUSTOMERS; i++) m_aCustFeatures[f][i] = (float)INIT;
    }
}

//-------------------------------------------------------------------
// Calculations - This Paragraph contains all of the relevant code
//-------------------------------------------------------------------

//
// CalcMetrics
// - Loop through the history and pre-calculate metrics used in the training 
// - Also re-number the customer id's to fit in a fixed array
//
void Engine::CalcMetrics()
{
    int i, cid;
    IdItr itr;

    wprintf(L"\nCalculating intermediate metrics\n");

    // Process each row in the training set
    for (i=0; i<m_nRatingCount; i++)
    {
        Data* rating = m_aRatings + i;

        // Increment movie stats
        m_aMovies[rating->MovieId].RatingCount++;
        m_aMovies[rating->MovieId].RatingSum += rating->Rating;
        
        // Add customers (using a map to re-number id's to array indexes) 
        itr = m_mCustIds.find(rating->CustId); 
        if (itr == m_mCustIds.end())
        {
            cid = 1 + (int)m_mCustIds.size();

            // Reserve new id and add lookup
            m_mCustIds[rating->CustId] = cid;

            // Store off old sparse id for later
            m_aCustomers[cid].CustomerId = rating->CustId;

            // Init vars to zero
            m_aCustomers[cid].RatingCount = 0;
            m_aCustomers[cid].RatingSum = 0;
        }
        else
        {
            cid = itr->second;
        }

        // Swap sparse id for compact one
        rating->CustId = cid;

        m_aCustomers[cid].RatingCount++;
        m_aCustomers[cid].RatingSum += rating->Rating;
    }

    // Do a follow-up loop to calc movie averages
    for (i=0; i<MAX_MOVIES; i++)
    {
        Movie* movie = m_aMovies+i;
        movie->RatingAvg = movie->RatingSum / (1.0 * movie->RatingCount);
        movie->PseudoAvg = (3.23 * 25 + movie->RatingSum) / (25.0 + movie->RatingCount);
    }
}

//
// CalcFeatures
// - Iteratively train each feature on the entire data set
// - Once sufficient progress has been made, move on
//
void Engine::CalcFeatures()
{
    int f, e, i, custId, cnt = 0;
    Data* rating;
    double err, p, sq, rmse_last, rmse = 2.0;
    short movieId;
    float cf, mf;

    for (f=0; f<MAX_FEATURES; f++)
    {
        wprintf(L"\n--- Calculating feature: %d ---\n", f);

        // Keep looping until you have passed a minimum number 
        // of epochs or have stopped making significant progress 
        for (e=0; (e < MIN_EPOCHS) || (rmse <= rmse_last - MIN_IMPROVEMENT); e++)
        {
            cnt++;
            sq = 0;
            rmse_last = rmse;

            for (i=0; i<m_nRatingCount; i++)
            {
                rating = m_aRatings + i;
                movieId = rating->MovieId;
                custId = rating->CustId;

                // Predict rating and calc error
                p = PredictRating(movieId, custId, f, rating->Cache, true);
                err = (1.0 * rating->Rating - p);
                sq += err*err;
                
                // Cache off old feature values
                cf = m_aCustFeatures[f][custId];
                mf = m_aMovieFeatures[f][movieId];

                // Cross-train the features
                m_aCustFeatures[f][custId] += (float)(LRATE * (err * mf - K * cf));
                m_aMovieFeatures[f][movieId] += (float)(LRATE * (err * cf - K * mf));
            }
            
            rmse = sqrt(sq/m_nRatingCount);
                  
            wprintf(L"     <set x='%d' y='%f' />\n",cnt,rmse);
        }

        // Cache off old predictions
        for (i=0; i<m_nRatingCount; i++)
        {
            rating = m_aRatings + i;
            rating->Cache = (float)PredictRating(rating->MovieId, rating->CustId, f, rating->Cache, false);
        }            
    }
}

//
// PredictRating
// - During training there is no need to loop through all of the features
// - Use a cache for the leading features and do a quick calculation for the trailing
// - The trailing can be optionally removed when calculating a new cache value
//
double Engine::PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing)
{
    // Get cached value for old features or default to an average
    double sum = (cache > 0) ? cache : 1; //m_aMovies[movieId].PseudoAvg; 

    // Add contribution of current feature
    sum += m_aMovieFeatures[feature][movieId] * m_aCustFeatures[feature][custId];
    if (sum > 5) sum = 5;
    if (sum < 1) sum = 1;

    // Add up trailing defaults values
    if (bTrailing)
    {
        sum += (MAX_FEATURES-feature-1) * (INIT * INIT);
        if (sum > 5) sum = 5;
        if (sum < 1) sum = 1;
    }

    return sum;
}

//
// PredictRating
// - This version is used for calculating the final results
// - It loops through the entire list of finished features
//
double Engine::PredictRating(short movieId, int custId)
{
    double sum = 1; //m_aMovies[movieId].PseudoAvg;

    for (int f=0; f<MAX_FEATURES; f++) 
    {
        sum += m_aMovieFeatures[f][movieId] * m_aCustFeatures[f][custId];
        if (sum > 5) sum = 5;
        if (sum < 1) sum = 1;
    }

    return sum;
}

//-------------------------------------------------------------------
// Data Loading / Saving
//-------------------------------------------------------------------

//
// LoadHistory
// - Loop through all of the files in the training directory
//
void Engine::LoadHistory()
{
    WIN32_FIND_DATA FindFileData;
    HANDLE hFind;
    bool bContinue = true;
    int count = 0; // TEST

    // Loop through all of the files in the training directory
    hFind = FindFirstFile(TRAINING_PATH, &FindFileData);
    if (hFind == INVALID_HANDLE_VALUE) return;
    
    while (bContinue) 
    {
        this->ProcessFile(FindFileData.cFileName);
        bContinue = (FindNextFile(hFind, &FindFileData) != 0);

        //if (++count > 999) break; // TEST: Uncomment to only test with the first X movies
    } 

    FindClose(hFind);
}

//
// ProcessFile
// - Load a history file in the format:
//
//   <MovieId>:
//   <CustomerId>,<Rating>
//   <CustomerId>,<Rating>
//   ...
void Engine::ProcessFile(wchar_t* pwzFile)
{
    FILE *stream;
    wchar_t pwzBuffer[1000];
    wsprintf(pwzBuffer,TRAINING_FILE,pwzFile);
    int custId, movieId, rating, pos = 0;
    
    wprintf(L"Processing file: %s\n", pwzBuffer);

    if (_wfopen_s(&stream, pwzBuffer, L"r") != 0) return;

    // First line is the movie id
    fgetws(pwzBuffer, 1000, stream);
    ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId);
    m_aMovies[movieId].RatingCount = 0;
    m_aMovies[movieId].RatingSum = 0;

    // Get all remaining rows
    fgetws(pwzBuffer, 1000, stream);
    while ( !feof( stream ) )
    {
        pos = 0;
        ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId);
        ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, rating);
        
        m_aRatings[m_nRatingCount].MovieId = (short)movieId;
        m_aRatings[m_nRatingCount].CustId = custId;
        m_aRatings[m_nRatingCount].Rating = (BYTE)rating;
        m_aRatings[m_nRatingCount].Cache = 0;
        m_nRatingCount++;

        fgetws(pwzBuffer, 1000, stream);
    }

    // Cleanup
    fclose( stream );
}

//
// ProcessTest
// - Load a sample set in the following format
//
//   <Movie1Id>:
//   <CustomerId>
//   <CustomerId>
//   ...
//   <Movie2Id>:
//   <CustomerId>
//
// - And write results:
//
//   <Movie1Id>:
//   <Rating>
//   <Raing>
//   ...
void Engine::ProcessTest(wchar_t* pwzFile)
{
    FILE *streamIn, *streamOut;
    wchar_t pwzBuffer[1000];
    int custId, movieId, pos = 0;
    double rating;
    bool bMovieRow;

    wsprintf(pwzBuffer, TEST_PATH, pwzFile);
    wprintf(L"\n\nProcessing test: %s\n", pwzBuffer);

    if (_wfopen_s(&streamIn, pwzBuffer, L"r") != 0) return;
    if (_wfopen_s(&streamOut, PREDICTION_FILE, L"w") != 0) return;

    fgetws(pwzBuffer, 1000, streamIn);
    while ( !feof( streamIn ) )
    {
        bMovieRow = false;
        for (int i=0; i<(int)wcslen(pwzBuffer); i++)
        {
            bMovieRow |= (pwzBuffer[i] == 58); 
        }

        pos = 0;
        if (bMovieRow)
        {
            ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId);

            // Write same row to results
            fputws(pwzBuffer,streamOut); 
        }
        else
        {
            ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId);
            custId = m_mCustIds[custId];
            rating = PredictRating(movieId, custId);

            // Write predicted value
            swprintf(pwzBuffer,1000,L"%5.3f\n",rating);
            fputws(pwzBuffer,streamOut);
        }

        //wprintf(L"Got Line: %d %d %d \n", movieId, custId, rating);
        fgetws(pwzBuffer, 1000, streamIn);
    }

    // Cleanup
    fclose( streamIn );
    fclose( streamOut );
}

//-------------------------------------------------------------------
// Helper Functions
//-------------------------------------------------------------------
bool Engine::ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut)
{
    int count = 0;
    int start = nPosition;
    wchar_t wc = 0;

    // Find start of number
    while (start < nLength)
    {
        wc = pwzBufferIn[start];
        if ((wc >= 48 && wc <= 57) || (wc == 45)) break;
        start++;
    }

    // Copy each character into the output buffer
    nPosition = start;
    while (nPosition < nLength && ((wc >= 48 && wc <= 57) || wc == 69  || wc == 101 || wc == 45 || wc == 46))
    {
        pwzBufferOut[count++] = wc;
        wc = pwzBufferIn[++nPosition];
    }

    // Null terminate and return
    pwzBufferOut[count] = 0;
    return (count > 0);
}

bool Engine::ParseFloat(wchar_t* pwzBuffer, int nLength, int &nPosition, float& fValue)
{
    wchar_t pwzNumber[20];
    bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber);
    fValue = (bResult) ? (float)_wtof(pwzNumber) : 0;
    return false;
}

bool Engine::ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue)
{
    wchar_t pwzNumber[20];
    bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber);
    nValue = (bResult) ? _wtoi(pwzNumber) : 0;
    return bResult;
}