Challenge Overview
Challenge Objectives
-
Creating Content Scraper CLI script that scrapes the web content for predefined pages, with performance (runtime) being the priority.
Technology Stack
Possible Tech Stack - please feel free to pick any other stack, if that helps meet the objectives:
-
Selenium
-
Puppeteer
-
PhantomJS
-
BeautifulSoup
Code access
We’re starting a new codebase - you create the project structure on your own
Individual requirements
The main goal for this challenge is to create a CLI script that performs 3 steps (no input parameters):
-
Navigate to products listing page (URL available in the forums), load all the products, and for each one extract product name, product id, product URL. Log the time taken to do the extraction
-
Navigate to each product URL and again extract the same attributes (product name, id, price). Log the total time taken to do the extraction. If URL redirects to other page, skip extract for the product and continue with next product.
-
Export the data from steps 1&2 to two Excel sheets (“Product listing” and “Products”)- both will have the same columns (product name, id, price, URL)
We already have a script that uses Selenium and Jsoup to perform the above actions and the total runtime is about 5 minutes for ~200 products - the goal in this challenge is to create a script that does the same in a shorter amount of time. It is up to you to pick the technology stack (as long as you provide a CLI script) - OS platform, programming language is up to you - only open source libraries/apis are allowed. You can try scraping in parallel, cache assets, execute Javascript (or not execute it)
What To Submit
-
All source code
-
Deployment guide
-
Verification guide along with total runtime info achieved in your testing
Scoring Methodology
Scoring will be based on code review and the achieved performance.