Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Topcoder has a client that is developing their ability to provide television and online program recommendations to their viewing public.  In order to do this, they are building a recommendation engine by implementing Collaborative Filtering based on Apache Spark using a Python interface.  This client is hoping that the Topcoder community can help them refine their recommendation engine and also develop some visualizations that provide intuitions about the viewing habits of their customers.

In this challenge, we’re going to be laying the groundwork for future work by implementing a Collaborative Filtering Application based on the Spark mlib library:

http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

This library uses the alternating least squares algorithm to learn latent factors.   Fortunately, we don’t have to start from ground zero when we’re developing our solution.  There is a movie recommendation system which has been developed as a tutorial for mlib collaborative filtering which we can use as a starting point for this system:

https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

The goal of this challenge provide a program recommendation system using the movie-recommendation system as a starting point.   Of course, you won’t simply be able to cut and paste code and get this to work.  The datasets are similar but not the same.  The original data set for the tutorial can be found here.  The format of the movie-recommendation files can be found in the <installation directory>/usb/data/movielens/medium.  

The data for the new collaborative filtering system has been set up to make the transition as straightforward as possible.  Just as in the movie-recommendation tutorial, we’ve created three files for you:

programs.csv - lists the programs and the corresponding genres associated with the shows
viewers.csv - list the anonymized viewer data along with their gender and birth dates
views.csv - lists the viewer ids along with the number of times each viewer has watched a show

Here is the data dictionary for the files:

Programs.csv

ProgramId: Integer -  PK for the Program
ProgramName: String - Name of the Program
Genre:  Name of the Genre -- these have been obfuscated but the original values where strings such as ‘Comedy’, ‘Drama’, etc.

Viewers.csv

ViewerId:  Integer - PK for a viewer
Gender:  Enum (‘m’, ‘f’, ‘u’) Male, Female, Unknown
DateOfBirth:  Date - Viewer Birth Date

Views.csv

ViewerId: Integer - FK to Viewer
Source: Enum(‘Source 1’, ‘Source 2’, ‘Source 3’, ‘Source 4’) - these values have been obfuscated but they represent the viewing mechanism/channel the viewers used to watch the show such as a web interface or through a mobile device.
MostRecentLocalDateKey:  Date of most recent view
MostRecentTimeOfVisit:  Timestamp of most recent view
ProgramId:  Integer - FK to Program
NumberOfViews: Number of times a viewer watch a particular program on a particular source in the year 2015.

You’ll need to adjust the data input processes to parse .csv files rather than working with the “::” delimiter that is used in the movie recommendation tutorial.

Our client has more ambitious goals for this application than simply providing a set of program recommendations for an individual viewer.   Here are the requirements for the application:

Requirement #1:  As part of this challenge, you should write a function which displays a listing of shows sorted in descending order by potential audience size.  The list should display the show name, the genres, and the number of viewers which have might have a preference for each show.  Please also display the median viewer for each show in terms of gender and birth date.  If this is computationally prohibitive, you can limit the list to the top X shows.

Requirement #2:  It would also be interesting to display the actual audience viewing statistics as well as the potential ones.  Can you also write a function which displays the lists of shows based on the actual viewing statistics?   The list should display the show name, the genres, and the number of viewers which have might viewed each program.  Please also display the median viewer for each show in terms of gender and birth date.   If this is computationally prohibitive, you can limit the list to the top X shows.

Requirement #3:  As part of your submission you should discuss what steps you’ve taken to minimize Root Mean Squared Error and how the submission compares to naive baseline.  For example, you could explain the data partitioning process you completed for your training, validation, and testing sets and how that process generated different error results.

Requirement #4:  (Optional)  Create a new function which measures Mean Average Precision for the various data sets.  Here is another article which describes Mean Average Precision in more user-friendly terms.

Requirement #5:  (Optional)  The core requirement here is to implement a recommendation engine using Python.  You may use technologies other than the Python Spark mlib library and sample code.  Of course, the recommendation engine must still process the data provided.



Final Submission Guidelines

  1. You should develop this application with the Python programming language.

  2. Please include all source code including data manipulation scripts with your submission zip file.

  3. If you’ve created test, training and validation files please provide a URL where those files can be downloaded.  You don’t need to provide those in the submission itself, but you should provide a submission.txt file in the root directory of your submission where links to your data files can be found.

  4. Please document your solution including analysis of the optimizations requested in Requirement #3 above.

  5. (Optional) If you decide to implement a MAP function and display those results, please include a description of how this was implemented in your documentation.

ELIGIBLE EVENTS:

2016 TopCoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30052933