Challenge Overview
The HP Data Audit group is looking to develop a toolset to validate and analyze HP Products available on their web site as well as other web sites. A significant part of developing this toolset is creating a web crawler to pull raw web pages into a file system directory where they can be analyzed and data extraction can be performed. In a previous challenge, we developed a web crawler to extract html files from the following site:
http://www.hp.com/country/us/en/hho/welcome.html
The purpose of the crawl is to extract the product pages such as this one:
http://store.hp.com/us/en/pdp/Laptops/hp-spectre-x360---13-4002dx-%28energy-star%29
In this code challenge, we're adding a couple of simple but important features to the crawler. You'll be using the code from the previous challenge as a starting point. The crawler.zip file can be found attached to the forums for this challenge. The zip file contains documentation instructions for deploying and running the application. Here are the additional requirements:
1. Please add a command line interface to the application. This app is going to run by unix script/cron job each night to pick up the latest set of product pages from a site. The application has a number of input parameters. On the command line, let's create parameters to designate the storage directory for intermediate data and the directory to stored target web pages. The rest of the parameters should be in a configuration file.
2. Create a configuration file for the following application input parameters: Target Web Site (domain name), Seeds (this is multiple urls), Total Number of Requests (int), Politeness Delay (int), Number of Crawlers (int), Target Pages Keywords (there could be multiple html snippets here), and Max Depth of Crawling. We should also add HTTP Proxy Parameters to the parameter list to allow the application to work behind a corporate proxy server. The parameters for this: Web Proxy Server (String for URL), Web Proxy Server Port (Int), Proxy Server Password Required (Boolean), Proxy Server Username (String), Proxy Server Password (String).
3. Instead of simply outputing the intermediate data to the designated storage directory directly, we need to create an a second level of directories based on the current date. Let me provide an example here:
Folder to Store Intermediate Data (Current Setup): /Users/Ward/Code/ProductCrawler/crawled_pages
Folder to Store Intermediate Data (New Setup): /Users/Ward/Code/ProductCrawler/crawled_pages/20150901 (yyyyMMdd)
The command-line parameter in this case would be "/Users/Ward/Code/ProductCrawler/crawled_pages". The crawler app will automatically add a folder with today's date to the path provided by the command- line parameter. Tomorrow when we run the app we'll create another new folder with today's date: /Users/Ward/Code/ProductCrawler/crawled_pages/20150902
4. The same applies to the target web page directory. We need to create an a second level of directories based on the current date. Let me provide an example here:
Folder to Store Targeted Web Pages (Current Setup): /Users/Ward/Code/ProductCrawler/data
Folder to Store Targeted Web Pages (New Setup): /Users/Ward/Code/ProductCrawler/data/20150901 (yyyyMMdd)
The command-line parameter in this case would be "/Users/Ward/Code/ProductCrawler/data". The crawler app will automatically add a folder with today's date to the path provided by the command-line parameter. The web pages for the current load will be loaded into the newly created directory.
5. Implement proxy server functionality for the app to allow HTTP/HTTPS Connectivity behind a proxy server.
Final Submission Guidelines
- Upload all your configuration, source code and sample data (described below) as a zip.
- Your solution should based your solution on the previously developed code and build processes. The code from the previously developed challenges can be found in the Code Document forums attached to this challenge.
- Provide documentation for your solution. You don't need to provide complete documentation for the application as you'll only be making changes, but please indicate what changes you've made in the code.
- The application creates a data directory in your submission.zip file. Please run the application and populate the data directory with 100 product html pages from the site. The new source.csv file should be in this directory as well.