Challenge Overview
Challenge Overview
Welcome to Infant nutrition ingredients and nutrients extraction challenge. In this challenge, we aim to create a CLI tool, that would read existing product images from the database, extract nutrients and ingredients information and save them to the database (Mongo)
Project Overview
In this project we will be:
-
Scraping retail sites for product info, ratings, reviews, nutrients and ingredients data
-
Identifying competing products across brands based on ingredients and nutrients data
-
Analyzing user reviews to identify topics, positives, and negatives for each product group and brand
-
Looking for identified items in social media posts to estimate how popular/important each one of them is
-
Providing reports that allow for drill-down per topic, brand, product group or individual product level
Technology Stack
-
NodeJS / Python
-
Mongo
-
OCR, Tesseract
Assets
The product scraper tool is available in the forums, just for reference. You should create a separate codebase for the OCR tool.
Product data backup is available in the forums - it contains data for about 2000 products, images included. Import the data into a “products” collection in Mongo DB.
Individual requirements
Create a CLI tool that reads the images from product records in the database and extracts nutrients and ingredients data.
Each product has several images and your first task is to determine how likely is that an image contains any ingredients/nutrients info. Log this information for each image in the console during processing. It is up to you to figure out the best strategy for identifying nutrients and ingredients data. Most of the products have at least one image with a full focus on ingredients/nutrients section on the product packaging. Here are two examples:
There can also be some products that don’t have any images with nutrients and ingredients data. In that case, the tool will not try any further extraction (in the future, we might focus on looking up that data in product description fields).
Data for both ingredients and nutrients to extract is:
-
Name of nutrient/ingredient
-
Amount (if available)
-
Unit (if available)
Additionally, the tool should extract what is the reference value the ingredients/nutrients refer to - is it per Xg, Xml, per serving, per package, etc.
Extracted data should be added to the product entry in the DB with the following structure:
{
ingredients: [{name,amount,unit, referenceValue}]
nutrients: [{name,amount,unit, referenceValue}]
}
Again, it is up to you to find a good strategy for parsing the OCR output - some images contain the info in a single text paragraph, while others have them in a table. Handling these cases properly will be the central part of this challenge and will affect the tool accuracy.
We don’t have a separate test/train data so your submission will be judged by manually reviewing tool output on a subset of the products in the database.
Log any errors to standard output. Add a parameter to the tool to process just one product from the database (ex -product 1782588). If it is not provided, process all the products in the database.
Create a docker file for the app. You can use the existing docker-compose configuration to start the mongo instance, just disable the product scraper container as it is not needed in this challenge. The database connection string should be a configurable value
What to submit
-
Submit the full source code for the tool and a README with configuration, deployment and verification steps