Challenge Overview
Challenge Details
Project Background
Customer is a global healthcare company, they research, develop and manufacture consumer healthcare products. The purpose of this project is to build a crawler to help our client find the price of specific products.
Technology Stack
Individual Requirements
Scope
You will be provided a list of ecommerce sites where you have to search for the keywords that are provided in forum. From the serach result you have to extract the following details that are mandatory.
Your code should be intelligent enough to detect new or changed information and insert/update only new or updated records. Dropping and reloading data (truncate and load approach) for each URL is not acceptable. Your project should have a configurable approach where more URLs can be added easily with individual spiders specific to the URL can be implemented.
Scheduler
You should provide the steps to run the code using a scheduler that can extract the data in a scheduled interval.
Database Entities
Create database entities for storing the data fetched in MS SQL Server.
Avoiding getting Banned
Your crawler should follow the best practices to avoid getting banned by the server. Refer the common practices for scrapy.
Deployment Guide and Validation Document
Make sure to require two separate documents for validation.
A README.md that covers:
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.
Important Notes
The URLs provided are Polish websites, however english based keyword will work and you can easily fetch the required data from the website.
- Write code which will extract the medicine details from a list of URLs
- Persist the extracted data in a database table
Project Background
Customer is a global healthcare company, they research, develop and manufacture consumer healthcare products. The purpose of this project is to build a crawler to help our client find the price of specific products.
Technology Stack
- Python 3.6
- Scrapy (Recommended)
- MS SQL Server 2017
Individual Requirements
Scope
You will be provided a list of ecommerce sites where you have to search for the keywords that are provided in forum. From the serach result you have to extract the following details that are mandatory.
- Name
- Price
- Category
- URL
Your code should be intelligent enough to detect new or changed information and insert/update only new or updated records. Dropping and reloading data (truncate and load approach) for each URL is not acceptable. Your project should have a configurable approach where more URLs can be added easily with individual spiders specific to the URL can be implemented.
Scheduler
You should provide the steps to run the code using a scheduler that can extract the data in a scheduled interval.
Database Entities
Create database entities for storing the data fetched in MS SQL Server.
Avoiding getting Banned
Your crawler should follow the best practices to avoid getting banned by the server. Refer the common practices for scrapy.
Deployment Guide and Validation Document
Make sure to require two separate documents for validation.
A README.md that covers:
- Deployment - that covers how to build and test your submission.
- Configuration - make sure to document the configuration that are used by the submission.
- Dependency Installation - should clearly describe the step-by-step guide for installing dependencies and should be up to date.
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.
Important Notes
The URLs provided are Polish websites, however english based keyword will work and you can easily fetch the required data from the website.
Final Submission Guidelines
- Submit your source code as zip file