Register
Submit a solution
The challenge is finished.

Challenge Overview

CHALLENGE OBJECTIVES
  • Implement a configuration based strategy into the scraper to avoid attack detection from EC sites.
 
PROJECT BACKGROUND
The purpose of this project is to implement a web scraper that will extract the purchase history list from specified E-commerce (EC) Site.
The tool will be used by our client who owns representative EC site in Japan.
This is more of a PoC than a formal tool.
 
WORKFLOW
Right now the tool uses the standard com.gargoylesoftware.htmlunit.WebClient to send requests, now we want to extend it or encapsulate it to have better control on the scraping process so that it's harder for the EC sites to detect our scraping process.
  
DETAILED REQUIREMENTS
Please check Design_Flow_Control.pdf which we posted in the forum to better understand what we need to implement in this challenge. In summary this is what we want:
  1. Users will configure the contents of "tactics" which will be used by the scraper, the configuration will configure things like:
    1. Users range: the users the scraper will use to scrape EC sites
    2. User agent list
    3. Request Interval
    4. Request Interval Random (Yes / No), defaults to No
    5. Proxy Server
    6. Retry Interval
    7. Retry Trial Count
  2. The standard WebClient needs to be extended or encapsulated so that it will properly use these configured parameters:
    1. It will only use the configured users to scrape EC sites
    2. It will randomly pick one of the configured user agents in the request for a specific user
    3. It will wait for the configured request interval between consecutive requests, and if the request interval random flag is Yes, it will randomize the request interval a bit instead of using fixed request intervals
    4. It will use the configured Proxy Server in the standard way
    5. It will wait for the specified interval before each retry, and will only retry for the specified count
  3. Update the scraper to use this new extended or encapsulated WebClient instead of the standard one. Please note you do NOT have to extend the WebClient, you can also just encapsulate it with extra logic.
  4. It's ok if these tactics cannot be applyed to every single physical request, but try to apply them as much as possible.
  5. Please note only the web-scraper-server  folder needs to be updated.
 
REQUIREMENTS WEIGHT
Here's how we'd weight each of the requirements:
  1. Users range: the users the scraper will use to scrape EC sites (20%)
  2. User agent list (20%)
  3. Request Interval (10%)
  4. Request Interval Random (Yes / No), defaults to No (10%)
  5. Proxy Server (20%)
  6. Retry Interval (10%)
  7. RetryTrial Count (10%)
This should give you an idea about the priority of these requirements, and should also serve as a guide for review.

OPERATING SYSTEMS & BROWSERS
Linux / Mac OS / Windows
Must work on on Chrome / Firefox / Safari (Mac only)
IE support is NOT mandatory
 
DEVELOPMENT ASSETS
Base code
 
TECHNOLOGY STACK
JDK 8
Gradle 3.5
Spring Boot 1.5.7
MySQL
Vue.js

Final Submission Guidelines

FINAL DELIVERABLES
  • All original source code.
  • A detailed readme in markdown format explaining how to build, configure and deploy your code.
  • A detailed verification document in markdown format showing how to verify that your submission works properly (note: video based verification is NOT acceptable).

ELIGIBLE EVENTS:

Topcoder Open 2019

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30087533