Register
Submit a solution
The challenge is finished.

Challenge Overview

 

The HP Data Audit group is looking to develop a toolset to validate and analyze HP Products available on their web site as well as other web sites.  A significant part of developing this toolset is creating a web crawler to pull raw web pages into a file system directory where they can be analyzed and data extraction can be performed.   In a previous challenge, we developed a web crawler to extract html files from the following site:

http://www.hp.com/country/us/en/hho/welcome.html

We've also recently added a command line interface to the toolset as well as the ability to designate various operational parameters by configuration file.   This was handled in our HP Product Inventory Web Crawler - Additional Features - Part 2 challenge which just completed.

The purpose of the crawl is to extract the html output from product pages such as this one:

http://store.hp.com/us/en/pdp/Laptops/hp-spectre-x360---13-4002dx-%28energy-star%29

One of the things that we're finding though with our current crawler output, however, is that it is missing some critical data which we need to collect.  The following data elements aren't being captured by simply pulling the raw html from the product pages such as the one listed above.  We're missing the following key elements of the page:

Ratings Info -- the number of rating "stars" associated with a product.  If no review has been completed we should be capturing the "Write the first review" text in the html output.
The number of reviews completed
The retail prices of the products.  (these currency values typically have line through them - strikethrough text)
Availability Info:  Coming Soon and Out of Stock info is typically rendered dynamically.

The reason these elements are missing is simple.  They are being generated dynamically through javascript.  We need the ability to capture these dynamically generated elements in our product html pages output that we're generating.   Some suggested ( but not prescriptive ) technologies for accomplishing this task:

Selenium
HTMLUnit
Crawljax

 

You'll be using the code from the previous challenge as a starting point.  The crawler.zip file can be found attached to the forums for this challenge.  The zip file contains documentation instructions for deploying and running the application.   



Final Submission Guidelines

  • - Upload all your configuration, source code and sample data (described below) as a zip.  
  • - Your solution should based your solution on the previously developed code and build processes.    The code from the previously developed challenges can be found in the Code Document forums attached to this challenge.
  • - Provide documentation for your solution.  You don't need to provide complete documentation for the application as you'll only be making changes, but please indicate what changes you've made in the code.
  • - The application creates a data directory in your submission.zip file.  Please run the application and populate the data directory with 100 product html pages from the site.  The new source.csv file should be in this directory as well.

ELIGIBLE EVENTS:

2016 TopCoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30051385