Challenge Overview

The HP Quality group is looking to develop a toolset to validate and analyze HP Products available on their web site as well as other web sites.  Previously, we developed a web crawler to extract html files from the following site:

http://www.hp.com/country/us/en/hho/welcome.html

The next step is develop our data extraction processes and load data into our target database server:  Vertica.  To start with, we're going to target four product types for data extraction:  Laptops, Tablets, Desktops, and Printers.    Here are the specific requirements:

1.  Create a SQL script to create 5 tables in Vertica:  Products, Laptops, Tables, Desktops, and Printers.  The source.csv file can be used to populate URL, SourceFile, and AuditTimestamp.  That file can be found attached to the forums for this challenge in the data.zip.  The field definitions for the other tables can be found here:

https://docs.google.com/spreadsheets/d/1TuZzcP7yj-6RWQBKs74J-8T5cW3fG-vCsADDaVK4VoA/edit?usp=sharing

2.  The source.csv file can be used to populate the majority of the data for the Products table:  URL, SourceFile, and AuditTimestamp fields.

3. The rest of the data will need to be extracted from attached 3070 html files found in the data.zip file using XPath or Regular Expressions or other parsing technology.   You should create the parsing/data extraction app in Java as a command-line application.  The logic to do the parsing may ultimately be included in a web application so a GUI for this application isn't required.  If a product page isn't for a Laptop, Table, Desktop, or Printer, it can be ignored.  

4.  Each html page, provided it relates to Laptop, Table, Desktop, or Printer will represent one record in the representive tables.  You should update the data elements in Vertica with the data parsed from the html.  You'll also need to update the product number field in the products table as well.  A few of the values in the schema spreadsheet defined above are derived values rather than literal ones.  For example, Price on the site might be "$399.99".  In our data model, we going to save 399.99 as a numeric value along with the "USD"  for the currency values.  The derived values in the schema doc are highlighted in blue.  Parse Date should be populated with current system time when the data is loaded to the database -- this is distinct from the Extraction Time in the source.csv file which designates when a page was crawled.

5. At the end of the extraction process the application should present a short summary of the number of files processed along with the number of Laptop, Table, Desktop, or Printers records placed into each table.

6.  The majority of the data being scraped is from the "Spec" section of each application page.  The application should also attempt to identify fields from the Spec section/tab of each product page that may not be included in the current schema.  For example, there are a variety of printer types and different data elements associated to these   The application should generate a short report -- a csv file is fine -- comparing the current fields for each our product types with the fields that are actually available in the pages.    The focus here being on additional fields that might be found in the Spec section of the product pages.

7.  You can download a community edition of Vertica directly from HP:  http://www.vertica.com/.   You simply sign in for a free developer account.   However, a direct Vertica installation requires a Unix/Linux server.  The more straightforward way to standup Vertica is to use VMWare.  VMWare also has free trials available.  A server image can be found at my.vertica.com.  But Topcoder is providing a recent disk image file for Vertica at the following link.  This is a large download (~2 GB).

https://drive.google.com/file/d/0ByjxTGykXQjAWkkwTWUzcXJucjQ/view?usp=sharing

JDBC Jar files for Vertica can be found here:

http://www.vertica.com/resources/vertica-client-drivers/

 



Final Submission Guidelines

  • Upload all your data definition sql and java source code in a zip file.
  • Provide documentation for your solution. Your documentation should provide precise deployment instructions, system requirements and dependencies, and instructions on how to run the application.
  • A screen-share video of your application is required describing its features, basic design and the API’s that your application uses.  

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30050923