Challenge Overview

Challenge Overview

 

Welcome to Infant nutrition social media scraper. In this challenge, we aim to create a CLI  tool - scraper (nodeJS), that would scrape twitter posts for specific keywords, filter and save the results to a database (Mongo)

Project Overview

In this project we will be:

  • Scraping retail sites for product info, ratings, reviews, nutrients and ingredients data

  • Identifying competing products across brands based on ingredients and nutrients data

  • Analyzing user reviews to identify topics, positives, and negatives for each product group and brand

  • Looking for identified items in social media posts to estimate how popular/important each one of them is

  • Providing reports that allow for drill-down per topic, brand, product group or individual product level

Technology Stack

  • NodeJS

  • Mongo

  • Twitter

Assets

Existing retail sites scraper tool is available in the forums. New code should be added to the same codebase (with a new npm target to scrape social media).

Products database backup is available in the forums.

Individual requirements

We have a collection of products saved to the Mongo database. Your task in this challenge is to create a CLI tool that searches Twitter (using the API) for specific keywords. List of search keywords is specific for each product. See below on how to construct the keywords.

The app should do the following:

  • Iterate through all the products in the database, 

  • Construct the search keywords for each product, 

  • Call the Twitter search API for each keyword

  • Filter the results (see below for filtering rules)

  • Save the search results to the product document as “socialMedia”:{“twitter”:[“tweet1 text”, “tweet2 text”, …]} - other tweet details like author and date are not needed - we only need the raw tweet text

Log any errors to standard output.

Create a docker file for the app and a docker-compose script that runs the app and starts a mongo DB.

Number of tweets to save per keyword should be configurable (default to 50)

Building the keywords

Each search query will contain the following: brand name, product group, topic. Those three sections should be combined using the OR operator. Brand name is already available as product.brand. Product topics are items of product.topics.positives and product.topics.negatives (number of topics=number of queries).

Product group is not available in the product document - you need to parse it from product name according to the following rules:

  • Remove brand name from product.name

  • Remove any text in parentheses (ex “(12 pack)”)

  • Remove all numbers

  • Split the remaining text on ‘,’ and take the first match - maximum of 3 words

NOTE: this is not a perfect set of rules that will work for all the products, but should be ok for now.

Example - for product name “(Pack of 12) Gerber Organic Baby Food Pear Peach Oatmeal 3.5 oz. Pouch” the product group would be “Organic Baby Food”

NOTE: many products will belong to the same product group and will have similar topics so you should cache the search results instead of searching for the same keywords again.

 

What to submit

  • Submit the full source code for the tool and a README with configuration, deployment and verification steps



Final Submission Guidelines

See above

ELIGIBLE EVENTS:

2020 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30123849