Register
Submit a solution
The challenge is finished.

Challenge Overview

Challenge Objectives

  • Creating Content Scraper CLI script that scrapes the web content for predefined pages, with performance (runtime) being the priority.

 

Technology Stack

 Possible Tech Stack - please feel free to pick any other stack, if that helps meet the objectives:

  • Selenium

  • Puppeteer

  • PhantomJS

  • BeautifulSoup

 

Code access

 

We’re starting a new codebase - you create the project structure on your own

 

Individual requirements

 

The main goal for this challenge is to create a CLI script that performs 3 steps (no input parameters):

  1. Navigate to products listing page (URL available in the forums), load all the products, and for each one extract product name, product id, product URL. Log the time taken to do the extraction

  2. Navigate to each product URL and again extract the same attributes (product name, id, price). Log the total time taken to do the extraction. If URL redirects to other page, skip extract for the product and continue with next product. 

  3. Export the data from steps 1&2 to two Excel sheets (“Product listing” and “Products”)- both will have the same columns (product name, id, price, URL)

 

We already have a script that uses Selenium and Jsoup to perform the above actions and the total runtime is about 5 minutes for ~200 products - the goal in this challenge is to create a script that does the same in a shorter amount of time. It is up to you to pick the technology stack (as long as you provide a CLI script) - OS platform, programming language is up to you - only open source libraries/apis are allowed. You can try scraping in parallel, cache assets, execute Javascript (or not execute it)

 

What To Submit

 

  • All source code

  • Deployment guide

  • Verification guide along with total runtime info achieved in your testing

 

Scoring Methodology

Scoring will be based on code review and the achieved performance.


 

Final Submission Guidelines

See above

ELIGIBLE EVENTS:

2020 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30123913