When I heard about the Netflix Prize,
I have to admit that I couldn't resist joining.
The stated goal of this contest is to help Netflix improve their movie recommendation system.
The team that can beat Netflix's own home-grown collaborative filtering system by 10% will win a million dollars.
Like many others, I have doubts as to whether this feat is possible given the sparsity of data and inherent noise.
Some speculate that the real goal is to fend off several software patents by showing prior works.
It doesn't matter much in either case, as I don't plan on winning this one :).

I joined this contest purely as a learning experience and to test out several new tools using this new wealth of demo data. At the time of writing, I am back on the leaderboard at 4% better than Netflix, but only due to some gracious hints from some of the other contestants. The methods and results of my investigation are presented below. Also check out this demo application. Hopefully this info and sample code will be of help to others.

After you join the contest, you are presented with over 2 gigs of csv files containing 100 million ratings.
These ratings were made by 480K (anonymous) users on a library of almost 18K movies.
Your goal is to make sense of the data and spit out a 20 mb file of results for a set of ratings not contained in the sample.
These are then compared against the users' actual ratings and you are given a score based on your error.
### Overal Rating Distribution

The following simple pie chart shows the relative mix of 1-5 ratings in the sample.
The most common rating found in the history was 4 and it occurred 33% of the time.
### Average Rating Distribution

The following chart shows the number of users and movies with a given average (rounded to the nearest 0.5).
Here both 6K movies and almost 200K customers had an average rating of 3.5.
### Rating Count Distribution

This chart shows the percentage of users and movies having AT MOST a given number of ratings.
For example: 50% of customers had less than 92 ratings.
60% of movies had less than 1000 ratings.
### Average Movie Rating By Movie Count

Here we can see that there is a positive correlation between a movie's scores and the number of ratings.
This makes sense as not many people would waste their time renting a movie they know is bad :)
### Standard Deviation By Movie Average

Interestingly we can see that movies judged extremely good or bad are rated **consistently** good or bad.
This is one good indicator for a theoretical max to rating improvement.
When taking a random rating from the most consistently loved movie, the average error from its mean is still almost 0.72.

While 100 million ratings may sound like a lot, the more data the better. There would actually be 8.5 BILLION ratings if every user rated every movie. That means we have just over 1% of a complete sample. While that amount may be good for many problem domains, in this case we have many users with only a few ratings and others who have rated over 10,000.

The charts below show some of the interesting aspects of the source data.
They are presented using a

charting solution
I am evaluating and require a Flash player (even a 3 year old version will do).

Having experience with sales and labor forecasting, my first impulse was to load all of the data into a database.
(Ironic side note: several years ago I worked on tools used by Netflix's brick-and-click rival).
Within a day of entering, I tried using some basic techniques including manually filtering out some boundary cases
(like jokers who rate all movies 1's)
and applying some simple adjusted averages.
This solution was very clean and quick, but not nearly the most accurate.
An RMSE of 0.98 (3% behind Netflix) was able to propel me to the top of the leaderboard, but soon dropped off the charts.
The following is an example of the SQL I used:

DECLARE @movie_stdev FLOAT; SET @movie_stdev = (SELECT AVG(stdev) FROM Movie) UPDATE Probe SET rating = (CASE WHEN c.stdev < 0.2 AND c.rating_count > 5 THEN c.rating_avg ELSE m.rating_avg + c.rating_offset/(@movie_stdev/m.stdev) END) FROM Probe p JOIN Movie m ON p.movie_id = m.movie_id JOIN Customer c ON p.customer_id = c.customer_id UPDATE Probe SET rating = 5 WHERE rating >= 5 UPDATE Probe SET rating = 1 WHERE rating <= 1 UPDATE Probe SET rating_diff = rating_actual - rating /* Calculate RSME */ SELECT SQRT(SUM(SQUARE(rating_diff)) / COUNT(*)) FROM Probe p

Next I tried using correlation to find 'look-alike' movies and customers.
The idea is that if someone really likes the first Spiderman movie, odds are they will like the sequel as well.
Additionally, if two customers have given similar ratings in the past, there is a good chance they will continue to do so in the future.
As there are a much larger number of customers than movies,
it proved very difficult to find enough common rental history to compare customers.
I then switched to correlating movies and had better results.
The following are some of the examples found by the algorithm.
The interesting thing to note is that the system did not use any data about the movies' name, category, or actors.
Everything it concluded was based solely on the likelyhood of the two movies appearing together and the similarity of their ratings.

Movie | Look-alike | Similarity Score |
---|---|---|

The Fellowship of The Ring | The Two Towers | 3.21760 |

The Fellowship of The Ring | The Return of the King | 2.69326 |

Friends: Season 1 | Best of Friends: Season 2 | 1.94697 |

Friends: Season 1 | Friends: Season 3 | 1.92805 |

Friends: Season 1 | Friends: Season 4 | 1.76589 |

Blue's Clues: Bluestock | Blue's Clues: Blue Talks | 0.96546 |

Blue's Clues: Bluestock | Blue's Clues: Blue's Room Snacktime | 0.95598 |

Blue's Clues: Bluestock | Blue's Clues: Blue Takes You to School | 0.84640 |

Sixteen Candles | The Breakfast Club | 0.76837 |

Sixteen Candles | Pretty in Pink | 0.68470 |

Bowling for Columbine | Fahrenheit 9/11 | 0.63799 |

On the probe set these kinds of correlations had an error of only 0.74. While that was great news, my algorithm only found a small number of unique pairings (28,000). When averaged with all of the unpaired movies, my overall score improved by less than a point. Others had more success using Slope One and Pearson Coefficient approaches to correlation, so in the future I would still like to pursue this path further.

Based on my prior results, I knew I had to find a solution that could handle sparsity well.
Ideally I would like to reduce the set of 17770 unique movies to a small number of categories
that would be easier to correlate.
From there I started reading several academic papers on dimensionality reducing techniques (including
Principal Components Analysis,
Singular Value Decomposition,
and Latent Semantic Analysis).
While I was very excited by the prospects, my math skills were a little rusty and I didn't have the time to get back up to speed.
So I did the prudent thing for industry types and waited for others to publish their results :).
Thankfully, with hints from several of the contest leaders (and some especially detailed descriptions from Simon Funk's blog),
I was able to implement an algorithm good enough to get back onto the leaderboard (at 0.9132 and 4% better than Netflix).
### Dimension 1

### Dimension 2

### Dimension 3

Below are some of the (dimensional) categories inferred by the algorithm. Customers who like movies on one extreme are more likely to hate movies on the other. While they do turn out to be statistically relevant, the categories don't map to what we'd normally think of as Dramas or Comedies. I have tried to label movies on each end of the spectrum as best as I am able.

Offbeat / Dark-Comedy | Mass-Market / 'Beniffer' Movies |
---|---|

Lost in Translation | Pearl Harbor |

The Royal Tenenbaums | Armageddon |

Dogville | The Wedding Planner |

Eternal Sunshine of the Spotless Mind | Coyote Ugly |

Punch-Drunk Love | Miss Congeniality |

Good | Twisted |
---|---|

VeggieTales: Bible Heroes: Lions | The Saddest Music in the World |

The Best of Friends: Season 3 | Wake Up |

Felicity: Season 2 | I Heart Huckabees |

Friends: Season 4 | Freddy Got Fingered |

Friends: Season 5 | House of 1 |

What a 10 year old boy would watch | What a liberal woman would watch |
---|---|

Dragon Ball Z: Vol. 17: Super Saiyan | Fahrenheit 9/11 |

Battle Athletes Victory: Vol. 4: Spaceward Ho! | The Hours |

Battle Athletes Victory: Vol. 5: No Looking Back | Going Upriver: The Long War of John Kerry |

Battle Athletes Victory: Vol. 7: The Last Dance | Sex and the City: Season 2 |

Battle Athletes Victory: Vol. 2: Doubt and Conflic | Bowling for Columbine |

I will be posting more details of my results as time goes on.
The biggest challenge right now is the tedious trial and error tuning of parameters (a full run can take as long as 8 hours).
The plot below shows the results of training the first four features and differences
between using an intelligent average at the start of training vs using all 1's:

This is probably what most of you are interested in, and rightfully so. :)

Though please share your results. I'd like to plot a graph of all of the different tuning parameters and their outcomes. The current parameters will give you one of the sets of results I blended to get my score.

Though please share your results. I'd like to plot a graph of all of the different tuning parameters and their outcomes. The current parameters will give you one of the sets of results I blended to get my score.

//============================================================================= // // SVD Sample Code // // Copyright (C) 2007 Timely Development (www.timelydevelopment.com) // // Special thanks to Simon Funk and others from the Netflix Prize contest // for providing pseudo-code and tuning hints. // // Feel free to use this code as you wish as long as you include // these notices and attribution. // // Also, if you have alternative types of algorithms for accomplishing // the same goal and would like to contribute, please share them as well :) // // STANDARD DISCLAIMER: // // - THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY // - OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT // - LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR // - FITNESS FOR A PARTICULAR PURPOSE. // //============================================================================= #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <stdio.h> #include <math.h> #include <tchar.h> #include <map> using namespace std; //=================================================================== // // Constants and Type Declarations // //=================================================================== #define TRAINING_PATH L"C:\Netflix\training_set*.txt" #define TRAINING_FILE L"C:\Netflix\training_set\%s" #define FEATURE_FILE L"C:\Netflix\features.txt" #define TEST_PATH L"C:\Netflix\%s" #define PREDICTION_FILE L"C:\Netflix\prediction.txt" #define MAX_RATINGS 100480508 // Ratings in entire training set (+1) #define MAX_CUSTOMERS 480190 // Customers in the entire training set (+1) #define MAX_MOVIES 17771 // Movies in the entire training set (+1) #define MAX_FEATURES 64 // Number of features to use #define MIN_EPOCHS 120 // Minimum number of epochs per feature #define MAX_EPOCHS 200 // Max epochs per feature #define MIN_IMPROVEMENT 0.0001 // Minimum improvement required to continue current feature #define INIT 0.1 // Initialization value for features #define LRATE 0.001 // Learning rate parameter #define K 0.015 // Regularization parameter used to minimize over-fitting typedef unsigned char BYTE; typedef map<int, int> IdMap; typedef IdMap::iterator IdItr; struct Movie { int RatingCount; int RatingSum; double RatingAvg; double PseudoAvg; // Weighted average used to deal with small movie counts }; struct Customer { int CustomerId; int RatingCount; int RatingSum; }; struct Data { int CustId; short MovieId; BYTE Rating; float Cache; }; class Engine { private: int m_nRatingCount; // Current number of loaded ratings Data m_aRatings[MAX_RATINGS]; // Array of ratings data Movie m_aMovies[MAX_MOVIES]; // Array of movie metrics Customer m_aCustomers[MAX_CUSTOMERS]; // Array of customer metrics float m_aMovieFeatures[MAX_FEATURES][MAX_MOVIES]; // Array of features by movie (using floats to save space) float m_aCustFeatures[MAX_FEATURES][MAX_CUSTOMERS]; // Array of features by customer (using floats to save space) IdMap m_mCustIds; // Map for one time translation of ids to compact array index inline double PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing=true); inline double PredictRating(short movieId, int custId); bool ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut); bool ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue); bool ParseFloat(wchar_t* pwzBuffer, int nLength, int &nPosition, float& fValue); public: Engine(void); ~Engine(void) { }; void CalcMetrics(); void CalcFeatures(); void LoadHistory(); void ProcessTest(wchar_t* pwzFile); void ProcessFile(wchar_t* pwzFile); }; //=================================================================== // // Program Main // //=================================================================== int _tmain(int argc, _TCHAR* argv[]) { Engine* engine = new Engine(); engine->LoadHistory(); engine->CalcMetrics(); engine->CalcFeatures(); engine->ProcessTest(L"qualifying.txt"); wprintf(L"\nDone\n"); getchar(); return 0; } //=================================================================== // // Engine Class // //=================================================================== //------------------------------------------------------------------- // Initialization //------------------------------------------------------------------- Engine::Engine(void) { m_nRatingCount = 0; for (int f=0; f<MAX_FEATURES; f++) { for (int i=0; i<MAX_MOVIES; i++) m_aMovieFeatures[f][i] = (float)INIT; for (int i=0; i<MAX_CUSTOMERS; i++) m_aCustFeatures[f][i] = (float)INIT; } } //------------------------------------------------------------------- // Calculations - This Paragraph contains all of the relevant code //------------------------------------------------------------------- // // CalcMetrics // - Loop through the history and pre-calculate metrics used in the training // - Also re-number the customer id's to fit in a fixed array // void Engine::CalcMetrics() { int i, cid; IdItr itr; wprintf(L"\nCalculating intermediate metrics\n"); // Process each row in the training set for (i=0; i<m_nRatingCount; i++) { Data* rating = m_aRatings + i; // Increment movie stats m_aMovies[rating->MovieId].RatingCount++; m_aMovies[rating->MovieId].RatingSum += rating->Rating; // Add customers (using a map to re-number id's to array indexes) itr = m_mCustIds.find(rating->CustId); if (itr == m_mCustIds.end()) { cid = 1 + (int)m_mCustIds.size(); // Reserve new id and add lookup m_mCustIds[rating->CustId] = cid; // Store off old sparse id for later m_aCustomers[cid].CustomerId = rating->CustId; // Init vars to zero m_aCustomers[cid].RatingCount = 0; m_aCustomers[cid].RatingSum = 0; } else { cid = itr->second; } // Swap sparse id for compact one rating->CustId = cid; m_aCustomers[cid].RatingCount++; m_aCustomers[cid].RatingSum += rating->Rating; } // Do a follow-up loop to calc movie averages for (i=0; i<MAX_MOVIES; i++) { Movie* movie = m_aMovies+i; movie->RatingAvg = movie->RatingSum / (1.0 * movie->RatingCount); movie->PseudoAvg = (3.23 * 25 + movie->RatingSum) / (25.0 + movie->RatingCount); } } // // CalcFeatures // - Iteratively train each feature on the entire data set // - Once sufficient progress has been made, move on // void Engine::CalcFeatures() { int f, e, i, custId, cnt = 0; Data* rating; double err, p, sq, rmse_last, rmse = 2.0; short movieId; float cf, mf; for (f=0; f<MAX_FEATURES; f++) { wprintf(L"\n--- Calculating feature: %d ---\n", f); // Keep looping until you have passed a minimum number // of epochs or have stopped making significant progress for (e=0; (e < MIN_EPOCHS) || (rmse <= rmse_last - MIN_IMPROVEMENT); e++) { cnt++; sq = 0; rmse_last = rmse; for (i=0; i<m_nRatingCount; i++) { rating = m_aRatings + i; movieId = rating->MovieId; custId = rating->CustId; // Predict rating and calc error p = PredictRating(movieId, custId, f, rating->Cache, true); err = (1.0 * rating->Rating - p); sq += err*err; // Cache off old feature values cf = m_aCustFeatures[f][custId]; mf = m_aMovieFeatures[f][movieId]; // Cross-train the features m_aCustFeatures[f][custId] += (float)(LRATE * (err * mf - K * cf)); m_aMovieFeatures[f][movieId] += (float)(LRATE * (err * cf - K * mf)); } rmse = sqrt(sq/m_nRatingCount); wprintf(L" <set x='%d' y='%f' />\n",cnt,rmse); } // Cache off old predictions for (i=0; i<m_nRatingCount; i++) { rating = m_aRatings + i; rating->Cache = (float)PredictRating(rating->MovieId, rating->CustId, f, rating->Cache, false); } } } // // PredictRating // - During training there is no need to loop through all of the features // - Use a cache for the leading features and do a quick calculation for the trailing // - The trailing can be optionally removed when calculating a new cache value // double Engine::PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing) { // Get cached value for old features or default to an average double sum = (cache > 0) ? cache : 1; //m_aMovies[movieId].PseudoAvg; // Add contribution of current feature sum += m_aMovieFeatures[feature][movieId] * m_aCustFeatures[feature][custId]; if (sum > 5) sum = 5; if (sum < 1) sum = 1; // Add up trailing defaults values if (bTrailing) { sum += (MAX_FEATURES-feature-1) * (INIT * INIT); if (sum > 5) sum = 5; if (sum < 1) sum = 1; } return sum; } // // PredictRating // - This version is used for calculating the final results // - It loops through the entire list of finished features // double Engine::PredictRating(short movieId, int custId) { double sum = 1; //m_aMovies[movieId].PseudoAvg; for (int f=0; f<MAX_FEATURES; f++) { sum += m_aMovieFeatures[f][movieId] * m_aCustFeatures[f][custId]; if (sum > 5) sum = 5; if (sum < 1) sum = 1; } return sum; } //------------------------------------------------------------------- // Data Loading / Saving //------------------------------------------------------------------- // // LoadHistory // - Loop through all of the files in the training directory // void Engine::LoadHistory() { WIN32_FIND_DATA FindFileData; HANDLE hFind; bool bContinue = true; int count = 0; // TEST // Loop through all of the files in the training directory hFind = FindFirstFile(TRAINING_PATH, &FindFileData); if (hFind == INVALID_HANDLE_VALUE) return; while (bContinue) { this->ProcessFile(FindFileData.cFileName); bContinue = (FindNextFile(hFind, &FindFileData) != 0); //if (++count > 999) break; // TEST: Uncomment to only test with the first X movies } FindClose(hFind); } // // ProcessFile // - Load a history file in the format: // // <MovieId>: // <CustomerId>,<Rating> // <CustomerId>,<Rating> // ... void Engine::ProcessFile(wchar_t* pwzFile) { FILE *stream; wchar_t pwzBuffer[1000]; wsprintf(pwzBuffer,TRAINING_FILE,pwzFile); int custId, movieId, rating, pos = 0; wprintf(L"Processing file: %s\n", pwzBuffer); if (_wfopen_s(&stream, pwzBuffer, L"r") != 0) return; // First line is the movie id fgetws(pwzBuffer, 1000, stream); ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId); m_aMovies[movieId].RatingCount = 0; m_aMovies[movieId].RatingSum = 0; // Get all remaining rows fgetws(pwzBuffer, 1000, stream); while ( !feof( stream ) ) { pos = 0; ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId); ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, rating); m_aRatings[m_nRatingCount].MovieId = (short)movieId; m_aRatings[m_nRatingCount].CustId = custId; m_aRatings[m_nRatingCount].Rating = (BYTE)rating; m_aRatings[m_nRatingCount].Cache = 0; m_nRatingCount++; fgetws(pwzBuffer, 1000, stream); } // Cleanup fclose( stream ); } // // ProcessTest // - Load a sample set in the following format // // <Movie1Id>: // <CustomerId> // <CustomerId> // ... // <Movie2Id>: // <CustomerId> // // - And write results: // // <Movie1Id>: // <Rating> // <Raing> // ... void Engine::ProcessTest(wchar_t* pwzFile) { FILE *streamIn, *streamOut; wchar_t pwzBuffer[1000]; int custId, movieId, pos = 0; double rating; bool bMovieRow; wsprintf(pwzBuffer, TEST_PATH, pwzFile); wprintf(L"\n\nProcessing test: %s\n", pwzBuffer); if (_wfopen_s(&streamIn, pwzBuffer, L"r") != 0) return; if (_wfopen_s(&streamOut, PREDICTION_FILE, L"w") != 0) return; fgetws(pwzBuffer, 1000, streamIn); while ( !feof( streamIn ) ) { bMovieRow = false; for (int i=0; i<(int)wcslen(pwzBuffer); i++) { bMovieRow |= (pwzBuffer[i] == 58); } pos = 0; if (bMovieRow) { ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId); // Write same row to results fputws(pwzBuffer,streamOut); } else { ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId); custId = m_mCustIds[custId]; rating = PredictRating(movieId, custId); // Write predicted value swprintf(pwzBuffer,1000,L"%5.3f\n",rating); fputws(pwzBuffer,streamOut); } //wprintf(L"Got Line: %d %d %d \n", movieId, custId, rating); fgetws(pwzBuffer, 1000, streamIn); } // Cleanup fclose( streamIn ); fclose( streamOut ); } //------------------------------------------------------------------- // Helper Functions //------------------------------------------------------------------- bool Engine::ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut) { int count = 0; int start = nPosition; wchar_t wc = 0; // Find start of number while (start < nLength) { wc = pwzBufferIn[start]; if ((wc >= 48 && wc <= 57) || (wc == 45)) break; start++; } // Copy each character into the output buffer nPosition = start; while (nPosition < nLength && ((wc >= 48 && wc <= 57) || wc == 69 || wc == 101 || wc == 45 || wc == 46)) { pwzBufferOut[count++] = wc; wc = pwzBufferIn[++nPosition]; } // Null terminate and return pwzBufferOut[count] = 0; return (count > 0); } bool Engine::ParseFloat(wchar_t* pwzBuffer, int nLength, int &nPosition, float& fValue) { wchar_t pwzNumber[20]; bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber); fValue = (bResult) ? (float)_wtof(pwzNumber) : 0; return false; } bool Engine::ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue) { wchar_t pwzNumber[20]; bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber); nValue = (bResult) ? _wtoi(pwzNumber) : 0; return bResult; }

- Collaborative Filtering
- Correlation
- Singular Value Decomposition (SVD)
- Principal Components Analysis (PCA)
- Latent Semantic Analysis (LSA)
- Pearson Correlation
- Dimensionality Reduction
- Slope One

- Numerical Recipes in C: The Art of Scientific Computing
- Statistics For People Who (Think They) Hate Statistics
- Principal Component Analysis
- Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns
- Data Mining with Computational Intelligence (Advanced Information and Knowledge Processing)
- Neural Networks for Pattern Recognition
- Pattern Classification