Topcoder Challenge | Topcoder Community

Challenge Overview

Welcome to Infant nutrition product reviews analysis updates challenge. We aim to update existing CLI tool that extracts commonly discussed topics in product user reviews.

Technology Stack

NodeJS
Mongo
Amazon, Walmart, FirstCry

Assets

Existing tool is provided in the forums. See forums for access to Gitlab. Read only access to the products database is provided in the forums. Copy the data to your local database for development.

Individual requirements

We have a database of product info scraped from retail sites. Among that data are product reviews. Our tool does three things:

Extracts sentiment (positive/neutral/negative) for each of the product reviews
Extracts topics discussed in the product reviews
Extracts positives and negatives for each product (topics with positive or negative sentiments that are discussed in the product reviews)

First task is straightforward and is already working fine - tool is using existing sentiment analysis libraries/frameworks and pretrained language models. All reviews are in English and there are quite a few products with 100+ reviews.

Second and third tasks are more challenging and improving on the current results is the main focus of this challenge. Goal is to identify the topics users discuss in the product reviews - things like product packaging, health effects, quality, etc, and group them into “positive” and “negative” topics - ones that contribute to the positive or negative sentiment about the product.

Current approach is using a combination of LDA, NMF and n-gram collocation and produces plenty of outputs, but the issue is that it doesn’t do a good job in pre-processing the raw text or integrating the results from individual algorithms. Some specific issues that are present are:

Similar word formations - ex identified topic “package” and “packaging” - the values should be normalized and duplicate topics merged
Identified topics that don’t make sense as the output - this is mainly about single word topics like “open”, “drink”, “half”, “stuff”, or multi word topics like “only issue”, “pre mix”, “other issue”, “many different”

For the topics to “make sense” to the end users, they would need to have a noun (at least one) and optionally (one or more) adjective, adverb or verb. That means that the final output of all individual algorithms should be filtered to make sure the topic is actually a valid one. You can use part of speech tagging for this purpose - the current tool is using POS tagging in rudimentary form to filter word tags, but doesn’t at all check if the generated topics are valid.

You are free to combine additional topic extraction algorithms/models/frameworks (ex BERT text summarization, UniLM models) to get better results, but do consider the existing LDA, IDF, bigram and trigram models when aggregating the topics and make sure to filter the list of topics to match the above criteria

Review notes

Reviewers will need to spend a bit more time reviewing this challenge - besides reviewing the code changes to verify correctness you should pick 10 random products from the database and compare the tool outputs with actual data in the database and note this in the review scorecard. Make sure to use the same set of products for all submissions to make the review consistent.

What to submit

Submit the full source code for the tool and a README with configuration, deployment and verification steps

Final Submission Guidelines

See above

Infant Nutrition - Review topic analysis updates

Challenge Overview

Challenge Overview

Technology Stack

Assets

Individual requirements

Review notes

What to submit

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30136439