Register
Submit a solution
The challenge is finished.

Challenge Overview

The HP Data Audit group is looking to develop a toolset to validate and analyze HP Products available on their web site as well as other web sites.  A significant part of developing this toolset is creating a web crawler to pull raw web pages into a file system directory where they can be analyzed and data extraction can be performed.   In this challenge, we're requesting that you write/configure a web crawler to crawl the following web site:

http://www.hp.com/country/us/en/hho/welcome.html

The purpose of the crawl is to extract the product pages such as this one:

http://store.hp.com/us/en/pdp/Laptops/hp-spectre-x360---13-4002dx-%28energy-star%29

There are a number of open source web crawlers which can be used:  Nutch 2.x, crawler4j, Scrapy, and many others.  You are not expected to write a crawler from scratch.  Rather this challenge is about setting up a process which can spider a site.  Caution:  you should be careful to only collect 1 page/second from the HP site.  You will need to set a reasonable delay between HTTP requests so that collectively we don't put an unreasonable burden on the HP product site. 

The crawling process should contain the following features:

1.  Ability to configure the HTTP request delay as described above.
2.  Ability to configure a limit on the total number of requests issued from the application.
3.  Ability to target the web crawl at a specific web site.
4.  Ability to separate the targeted web pages (in this case the product pages) into their own file system directory.    This will be helpful when it is time to perform extraction activities.  It's ok if all of the pages in the site need to be extracted for spidering activities but we want to be able to designate certain pages for data extraction and loaded into their own directory.
 

 



Final Submission Guidelines

  • Upload all your configuration, source code and sample data (described below) as a zip
  • Provide documentation for your solution. Your documentation should provide precise deployment instructions, system requirements and dependencies, and instructions on how to run the application.
  • You should create a data directory in your submission.zip file.  In this directory please provide 100 product html pages from the site.
  • A screen-share video of your application is required describing its features, basic design and the API’s that your application uses.  Video entries are critical in the evaluation of your submission.

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30050691