Challenge Overview

A Topcoder retail customer has both a network of brick-and-mortar stores, as well as an online storefront. The customer wants to improve the quality of search and listing results they provide to their online storefront visitors. Purchase data from the brick-and-mortar stores alone doesn’t help our customer understand what products are popular outside of their stores. Since they can sell a wider variety of products on their online storefront than in their stores, they seek a new model that can help them list the most popular and trending products first when people visit or search the storefront.

One important source of information that can help make sure highly popular results are listed first is the internet itself. Items that are popular on the web stores, and receive high positive sentiment value on the web in general, are more likely to be relevant to buyers (for this contest you can assume the correlation is proven). Web data can be used to predict which products will be popular in the coming days, weeks and months (though perhaps not longer than that).

The goal of this challenge is to use web data to predict the popularity and sentiment of certain products for current time.

Task Detail

The objective of this challenge is to research and find dataset(s) and develop an algorithm that predicts the popularity of a selection of important retail products. The algorithm should be a functioning proof of concept (in R, Python, Jupyter - your choice). For each item in the list, you are asked to predict a product popularity score and product sentiment score for current time.

Here is how we define popularity and sentiment in this challenge:

Popularity (Product Popularity Score): You are expected to develop a popularity score formula yourself. As a component of this score, you are required to include a Sentiment score. To illustrate what we THINK is needed: one implementation of a score could be an aggregate of metrics such as sentiment, the total number of sales for the product, the total number of reviews received for the product, perhaps the number of times the product is viewed, and others. You can assign weights to your selected metrics. A large component of the evaluation of your solution (see below) will be based on your choices for Popularity score.
Sentiment (Product Sentiment Score): Product sentiment score refers to a numeric value that predicts the positivity or negativity sentiment of a product and usually is a result of sentiment analysis of texts related to the product such as product reviews.

Input Data

Here are the definitions of the key terms of the input data:

Category: The “family” of the product type. This is the coarsest grouping, generally by store layout (“go to the beer section” or “go to the candy section”).
Tags: The “genus” of the product type. This groups products roughly by “kind-of-thing” (“what kind of beer would you like?” or “what kind of chocolate would you like”)
Name: The name of the product

The format of the input to this challenge is:

Category Tags Name

Beer Singles Steel Reserve 211 High Gravity 24oz Can

Beer Singles Mike's Harder Lemonade 16oz Can

Beer Malt Mike's Harder Strawberry Lemonade 16oz Can

Candy Chocolate Pokemon Candy Bar

Candy Chocolate Reese's PB King Size 2.8oz

Candy Chocolate Kit Kat King Size 3oz

... ... ...

The sample data can be downloaded here. Please look at the check point requirements below for identification of dataset. Important note: while we are providing this sample data in order to make the problem more manageable, we want this algorithm to find and rank items that are popular on the web but missing from our list. We want to add to our inventory too.

Best Dataset(s) Award and First Week Check Point 1:

Part of the problem statement is to identify the best data set available as open source or commercial with a free trial.

We will award best datasets with prize ($250+ each) and when we do:
a) the datasets will be included for ALL competitors: we’ll post them in the forum

b) there’s no requirement to submit models to win the best-data award

Expected Output

The format of the expected output from this challenge is:

Category | Tags | Name | Popularity Score | Sentiment Score

You must generate the output file which calculates the popularity and sentiment score at the current time.

Results of our Previous Ideation Challenge

In a previous challenge, we asked contestants for ideas on how we might go about solving this problem. We learned that while existing libraries and best practices may provide building blocks to a solution of this problem, they aren’t rigorous enough to solve it alone. You will need to do more than use a tool to screen-scrape Amazon or other web results.

Final Submission Guidelines

Your submission must include the following items:

Working Prototype: Develop a prototype that predicts the popularity score and sentiment trends score of products on the web. Your solution must be easy to use and run. Reviewers will spend no more than 10 minutes attempting to set up your submission. Docker containers are highly encouraged and solutions submitted in Docker will receive a $100 bonus on top of the list prize, if you rank in the top-4.
Your Predictions. Output should be in the form of .csv file, described above.
Detailed Research Whitepaper: We provide a template for your final report. You must use this template for your solution to be evaluated. In the Discussion section of your write-up, you must specifically address the items listed under Evaluation Criteria.

Research Whitepaper Template

Your white paper must be a text, .doc, PPT or PDF document that includes the following sections and descriptions:

Overview: describe your approach in “layman's terms”
Methods: describe what you did to come up with this approach, eg literature search, experimental testing, etc. In addition you must describe your approach to:
- Feature Engineering - what did you use to calculate popularity?
- Machine Learning - Explain the techniques you used and modifications you made to them
Data: How did you collect data and which resources did you collect it from? Are you using derived data? Did you enrich it in some fashion? What other data should we consider? If you used trial/for-purchase data, list the trial terms and purchase info here (if possible). Also, what about the data described/provided - is it enough?
Materials: did your approach use a specific technology? Any libraries? List all tools and libraries you used.
Discussion: Explain what you attempted, considered or reviewed that worked, and especially those that didn’t work or that you rejected. For any that didn’t work, or you rejected, briefly include your explanation for the reasons (e.g. such-and-such needs more data than we have). If you are pointing to somebody else’s work (eg you’re citing a well known implementation or literature), describe in detail how that work relates to this work, and what would have to be modified
Assumptions and Risks: what are the main risks of this approach, and what are the assumptions you/the model is/are making? What are the pitfalls of the data set and approach?
Results: Defend your results to an expert audience. How well did your approach perform? Did it behave as expected?
Other: Discuss any other issues or attributes that don’t fit neatly above that you’d also like to include

Judging Criteria

You will be judged on the quality of your ideas, the quality of your description of the ideas, and how much benefit it can provide to the client. The winner will be chosen by the most logical and convincing reasoning as to how and why the idea presented will meet the objective. Note that, this contest will be judged subjectively by the client and Topcoder. However, the judging criteria will largely be the basis for the judgement.

Feature Engineering (40%)
1. It is important for stakeholders to understand what features in data to use to maximize objective of a popularity prediction model.
2. Please mention any and all feature engineering or information extraction around the scope of the product catalog document that could be beneficial for the prototype.
Data Collection (25%)
1. Its essential for us to understand how the framework is collecting the data and from what sources.
2. As noted above we learned in an initial proof of concept that there is not an easy to use platform to collect the data from competitive e-commerce websites such as Amazon. Hence, we are looking for detailed explanation and algorithms to address this concern.
3. It is permissible to use open source data
4. It is permissible to use trial or proprietary data provided it is available for purchasing.
Machine Learning (25%)
1. Selecting the tools and library to be used in the project the type of machine learning algorithms (e.g. Classification, Clustering, Deep Neural Network) needs to be specified.
2. More importantly, there should be an explanation on why and how the popularity model should be mapped to a particular ML model. Just mentioning tools such as IBM Watson, IBM NLU or libraries such as Python spacy is NOT ENOUGH, we are looking for DETAILS of ML algorithms and other specifics used from these applications.
Clarity (10%)

Please make sure your report is easy to read.
Figures, charts, and tables are welcome.

Submission Guideline

You can submit at most TWO solutions.

Retail Popularity & Sentiment Prediction POC Ideation Challenge

Key Information

Challenge Overview

Challenge Overview

Task Detail

Best Dataset(s) Award and First Week Check Point 1:

Final Submission Guidelines

Judging Criteria

Submission Guideline

Best Dataset(s) Award and First Week Check Point 1:

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30095544