Challenge Overview
Challenge Details
We ran few challenges to scrap data from few websites that stores the result in csv file. The goal of this challenge is to- Fetch the product images and save it in Azure Blob Storage
- Update the output csv with the Public URL for the stored image in Azure Blob Storage
- Add an Error Reporting csv file
Project Background
Customer is a global healthcare company, they research, develop and manufacture consumer healthcare products. The purpose of this project is to build a crawler to help our client find the price of specific products.Technology Stack
- Python 3.7
- Scrapy
- Azure Blob Storage
Individual Requirements
Challenge InputThe data collection code will be shared in the challenge forum.
Scope
- Update the data scraper to fetch the product image from the websites and store the images in Azure Blob Storage
- Make sure that the storage account name, account key and connection string are configurable.
- Use Azure SDK for Python
- Update the output csv generated with the product URL also which should be publicly accessible.
- Error Reporting
- While running the scraping some of the websites are throwing errors, timeouts etc.
- A new csv file error-report-*.csv should be created which should have the URL, Error Type and Error Details in it.
Deployment Guide and Validation Document
Make sure to require two separate documents for validation.A README.md that covers:
- Deployment - that covers how to build and test your submission.
- Configuration - make sure to document the configuration that are used by the submission.
- Dependency Installation - should clearly describe the step-by-step guide for installing dependencies and should be up to date.
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.
Important Notes
- You can create a free account in Azure for this challenge purpose.
- Make sure to document the process for setting up the Azure Blob Storage and making it publicly accessible in the README file.