Challenge Overview
Challenge Overview
Welcome to Infant nutrition product reviews analysis. In this challenge, we aim to create a CLI tool - that would analyze product reviews data to extract commonly discussed topics in user reviews.
Project Overview
In this project we will be:
-
Scraping retail sites for product info, ratings, reviews, nutrients and ingredients data
-
Identifying competing products across brands based on ingredients and nutrients data
-
Analyzing user reviews to identify topics, positives, and negatives for each product group and brand
-
Looking for identified items in social media posts to estimate how popular/important each one of them is
-
Providing reports that allow for drill-down per topic, brand, product group or individual product level
Technology Stack
-
NodeJS
-
Mongo
-
Amazon, Walmart, FirstCry
Assets
We’re starting a new codebase. It’s up to you to create the base code for the tool.
Data extraction tool is available in the project repository (just for reference). See forums for access to Gitlab. Read only access to the products database is provided in the forums. Copy the data to your local database for development.
Individual requirements
Create a CLI tool that does three things:
-
Extracts sentiment (positive/neutral/negative) for each of the product reviews (easy) and compute average product sentiment (just aggregate sentiments of all reviews for the product)
-
Extracts topics discussed in the product reviews (medium-hard)
-
Extracts positives and negatives for each product (topics with positive or negative sentiments that are discussed in the product reviews) (medium-easy)
First task should be fairly straightforward - you can use existing sentiment analysis libraries/frameworks and pretrained language models. All reviews are in English and there are quite a few products with 100+ reviews. For each review update the database record to add sentiment details (3 percentage values for confidence of positive/neutral/negative sentiment). Once that is done, compute the average product sentiment values and save them to the product record in the database (add sentiment:{positive:x, neutral:y, negative:z} attribute).
Second task is more challenging - the goal is to identify the topics users discuss in the product reviews - things like product packaging, health effects, quality, etc. You are free to use existing language models or train new ones and it’s up to you to choose appropriate algorithms (ex LDA, NMF, etc) or frameworks (BERT text summarization, UniLM models) or something entirely different (these are just suggestions). If you feel the number of reviews is low, you can combine the product details by product.upc (same products scraped from multiple retail sites), but this is not required for this challenge.
Just to provide some more insight on what we will use these topics - we plan to use them to get more product review data from social media by looking for posts mentioning some combination of product brand, product category, custom keywords and (the important part) - topics you identify in this challenge - search for “Enfamil baby formula vitamin D benefits” would probably yield better results than searching just raw product name like “Happy Baby Organic Infant Formula, 1-12 Months, 21 Ounce, 4 Count...”
The final task in this challenge is to try to identify positives and negatives discussed in reviews for each product, based on the identified topics - ex if for some product a topic “vitamin D benefits” is discussed with a negative sentiment in multiple reviews, “vitamin D benefits” would be a “negative” for that product. Again, it is up to you to choose the right algorithms/libraries/models/frameworks. Save the identified topics to the product record in database (add topics:{positives:[p1,p2], negatives:[n1,n2]} attribute)
Log any errors to standard output.
Review notes
Reviewers will need to spend a bit more time reviewing this challenge - besides reviewing the code changes to verify correctness you should pick 10 random products from the database and compare the tool outputs with actual data in the database and note this in the review scorecard. Make sure to use the same set of products for all submissions to make the review consistent.
What to submit
-
Submit the full source code for the tool and a README with configuration, deployment and verification steps