AthenaEC - Data Gathering Challenge Set 2

Register
Submit a solution
The challenge is finished.

Challenge Overview

Challenge Details

We ran a data gathering challenge to scrap data from few webistes and in this challenge we need to add more websites to the existing code 
  • Which will extract the medicine details from a list of URLs
  • Persist the extracted data in a database table

Project Background

Customer is a global healthcare company, they research, develop and manufacture consumer healthcare products. The purpose of this project is to build a crawler to help our client find the price of specific products.

Technology Stack
  • Python 3.6
  • Scrapy (Recommended)
  • MS SQL Server 2017

Individual Requirements

Scope
The list of the webistes that need to be added are provided in forum along with the keyword.

Your code should be intelligent enough to detect new or changed information and insert/update only new or updated records.  Dropping and reloading data (truncate and load approach) for each URL is not acceptable. Your project should have a configurable approach where more URLs can be added easily with individual spiders specific to the URL can be implemented.

Database Entities
Use the existing entites in the code. If there is any change is required in the existing entity it should be discussed in forum.

Avoiding getting Banned
Your crawler should follow the best practices to avoid getting banned by the server. Refer the common practices for scrapy. 

Deployment Guide and Validation Document

Make sure to require two separate documents for validation.

A README.md that covers:
  • Deployment - that covers how to build and test your submission.
  • Configuration - make sure to document the configuration that are used by the submission.
  • Dependency Installation -  should clearly describe the step-by-step guide for installing dependencies and should be up to date.
A Validation.md that covers:
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.

Important Notes
The URLs provided are Polish websites, however english based keyword will work and you can easily fetch the required data from the website.


Final Submission Guidelines

Submit git patch

ELIGIBLE EVENTS:

Topcoder Open 2019

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30078202