Register
Submit a solution
The challenge is finished.

Challenge Overview

Challenge Overview

 

Welcome to Infant nutrition ingredients and nutrients extraction challenge. In this challenge, we aim to create a CLI  tool, that would read existing product images from the database, extract nutrients and ingredients information and save them to the database (Mongo)

Project Overview

In this project we will be:

  • Scraping retail sites for product info, ratings, reviews, nutrients and ingredients data

  • Identifying competing products across brands based on ingredients and nutrients data

  • Analyzing user reviews to identify topics, positives, and negatives for each product group and brand

  • Looking for identified items in social media posts to estimate how popular/important each one of them is

  • Providing reports that allow for drill-down per topic, brand, product group or individual product level

Technology Stack

  • NodeJS / Python

  • Mongo

  • OCR, Tesseract

 

Assets

The product scraper tool is available in the forums, just for reference. You should create a separate codebase for the OCR tool.

Product data backup is available in the forums - it contains data for about 2000 products, images included. Import the data into a “products” collection in Mongo DB.

 

Individual requirements

Create a CLI tool that reads the images from product records in the database and extracts nutrients and ingredients data. 

Each product has several images and your first task is to determine how likely is that an image contains any ingredients/nutrients info. Log this information for each image in the console during processing. It is up to you to figure out the best strategy for identifying nutrients and ingredients data. Most of the products have at least one image with a full focus on ingredients/nutrients section on the product packaging. Here are two examples:

There can also be some products that don’t have any images with nutrients and ingredients data. In that case, the tool will not try any further extraction (in the future, we might focus on looking up that data in product description fields).

Data for both ingredients and nutrients to extract is:

  • Name of nutrient/ingredient

  • Amount (if available)

  • Unit (if available)

Additionally, the tool should extract what is the reference value the ingredients/nutrients refer to - is it per Xg, Xml, per serving, per package, etc.

Extracted data should be added to the product entry in the DB with the following structure:
{
    ingredients: [{name,amount,unit, referenceValue}]
    nutrients: [{name,amount,unit, referenceValue}]
}

Again, it is up to you to find a good strategy for parsing the OCR output - some images contain the info in a single text paragraph, while others have them in a table. Handling these cases properly will be the central part of this challenge and will affect the tool accuracy. 

We don’t have a separate test/train data so your submission will be judged by manually reviewing tool output on a subset of the products in the database.

Log any errors to standard output. Add a parameter to the tool to process just one product from the database (ex -product 1782588). If it is not provided, process all the products in the database.

Create a docker file for the app. You can use the existing docker-compose configuration to start the mongo instance, just disable the product scraper container as it is not needed in this challenge. The database connection string should be a configurable value

What to submit

  • Submit the full source code for the tool and a README with configuration, deployment and verification steps



Final Submission Guidelines

See above

ELIGIBLE EVENTS:

2020 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30119066