Challenge Overview
Challenge Overview
Welcome to Infant nutrition social media scraper. In this challenge, we aim to create a CLI tool - scraper (nodeJS), that would scrape twitter posts for specific keywords, filter and save the results to a database (Mongo)
Project Overview
In this project we will be:
-
Scraping retail sites for product info, ratings, reviews, nutrients and ingredients data
-
Identifying competing products across brands based on ingredients and nutrients data
-
Analyzing user reviews to identify topics, positives, and negatives for each product group and brand
-
Looking for identified items in social media posts to estimate how popular/important each one of them is
-
Providing reports that allow for drill-down per topic, brand, product group or individual product level
Technology Stack
-
NodeJS
-
Mongo
-
Twitter
Assets
Existing retail sites scraper tool is available in the forums. New code should be added to the same codebase (with a new npm target to scrape social media).
Products database backup is available in the forums.
Individual requirements
We have a collection of products saved to the Mongo database. Your task in this challenge is to create a CLI tool that searches Twitter (using the API) for specific keywords. List of search keywords is specific for each product. See below on how to construct the keywords.
The app should do the following:
-
Iterate through all the products in the database,
-
Construct the search keywords for each product,
-
Call the Twitter search API for each keyword
-
Filter the results (see below for filtering rules)
-
Save the search results to the product document as “socialMedia”:{“twitter”:[“tweet1 text”, “tweet2 text”, …]} - other tweet details like author and date are not needed - we only need the raw tweet text
Log any errors to standard output.
Create a docker file for the app and a docker-compose script that runs the app and starts a mongo DB.
Number of tweets to save per keyword should be configurable (default to 50)
Building the keywords
Each search query will contain the following: brand name, product group, topic. Those three sections should be combined using the OR operator. Brand name is already available as product.brand. Product topics are items of product.topics.positives and product.topics.negatives (number of topics=number of queries).
Product group is not available in the product document - you need to parse it from product name according to the following rules:
-
Remove brand name from product.name
-
Remove any text in parentheses (ex “(12 pack)”)
-
Remove all numbers
-
Split the remaining text on ‘,’ and take the first match - maximum of 3 words
NOTE: this is not a perfect set of rules that will work for all the products, but should be ok for now.
Example - for product name “(Pack of 12) Gerber Organic Baby Food Pear Peach Oatmeal 3.5 oz. Pouch” the product group would be “Organic Baby Food”
NOTE: many products will belong to the same product group and will have similar topics so you should cache the search results instead of searching for the same keywords again.
What to submit
-
Submit the full source code for the tool and a README with configuration, deployment and verification steps