Topcoder Challenge | Topcoder Community

Challenge Overview

Challenge Objectives

Creating Content Scraper CLI script that scrapes the web content for predefined pages, with performance (runtime) being the priority.

Technology Stack

Possible Tech Stack - please feel free to pick any other stack, if that helps meet the objectives:

Selenium
Puppeteer
PhantomJS
BeautifulSoup

Code access

We’re starting a new codebase - you create the project structure on your own

Individual requirements

The main goal for this challenge is to create a CLI script that performs 3 steps (no input parameters):

Navigate to products listing page (URL available in the forums), load all the products, and for each one extract product name, product id, product URL. Log the time taken to do the extraction
Navigate to each product URL and again extract the same attributes (product name, id, price). Log the total time taken to do the extraction. If URL redirects to other page, skip extract for the product and continue with next product.
Export the data from steps 1&2 to two Excel sheets (“Product listing” and “Products”)- both will have the same columns (product name, id, price, URL)

We already have a script that uses Selenium and Jsoup to perform the above actions and the total runtime is about 5 minutes for ~200 products - the goal in this challenge is to create a script that does the same in a shorter amount of time. It is up to you to pick the technology stack (as long as you provide a CLI script) - OS platform, programming language is up to you - only open source libraries/apis are allowed. You can try scraping in parallel, cache assets, execute Javascript (or not execute it)

What To Submit

All source code
Deployment guide
Verification guide along with total runtime info achieved in your testing

Scoring Methodology

Scoring will be based on code review and the achieved performance.

Final Submission Guidelines

See above

Content Scraper Tool Challenge

Challenge Overview

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30123913