Challenge Overview

CHALLENGE OBJECTIVES
  • Spring Boot Application Development.
  • Build a data processor to convert data from web crawled data to an Indexed and searchable Solr data.
  • Unit tests are needed.
 
PROJECT BACKGROUND
  • The customer wants to build a system to search web pages of various products from the internet. This system provides REST-based API and users can search with keywords and obtain information about product page from the API.
  • The Data Conversion Processor will convert data from the web crawled RDBMS MySQL database to an indexed and searchable Solr database. This is needed for fast and efficient searches of the manufacturers products. The searchable data would also be filtered for any extraneous information such as HTML tags and old expired data.
 
TECHNOLOGY STACK
  • Java 8
  • Spring Boot 1.5.7
  • MySQL 8.0
  • Apache Solr 8.1.1
 
GENERAL REQUIREMENTS
  • The data processor should be built in a generic and reusable way such that it would be easy to add new websites in the future.
  • Submission must include Unit Tests and brief docs for setup & usage
 
DETAILED REQUIREMENTS
Starting the Data Converter
The Data Conversion Processor maybe run on all sites contained in the Web Crawled database or only on a specific site by specifying a site parameter. 
  • Run with --site=<site id>
    • This will run the Data Converter only on the records contained in the pages table for the specific website as identified by the --site parameter.
  • Run without the --site parameter.  This will process all the records in the pages table.
 
Process Newly Acquired Data Only:
  • Only process pages data stored from a Web Crawl (MySQL db) that were updated (last_modified_at date-time field) after the last data conversion date-time (stored in the last_processed_at field) in the pages MySQL Table.  So if the last_processed_at date-time is either empty or earlier than the last_modified_at date-time for a page then process this page for conversion.  
    • Process page IF last_processed_at is NULL OR Less Than last_modifed_at for that page.
    • Pre-condition: last_modified_at date-time will never be NULL and will equal to created_at date-time when the page record is first created.
  • For this version it is assumed that the Data Conversion Processor will run only after the Web Crawler process has ended.  And not in parallel.  This assumption is to avoid any potential data concurrency issue when running both processes in parallel.
  • Process Flow:
    • The Data Conversion Processor processes each record from the pages table.
    • For each record it will extract the “manufacturer_name” from the name field of the web_sites table. 
    • If the deleted field for a record in the pages table is set to true then the corresponding record, as referenced by the same URL (pages.url ⇔ manufacturer_product.product_url)  in the Solr Index will be deleted if it exists.
    • If the last_modified_at date-time for the record in the pages table is older than a certain threshold (e.g. 356 days), then the corresponding record in the Solr Index will be deleted, if it already exists.
    • For each record in the pages table it will compare it’s last_modifed_at date-time with the last_processed_at date-time and process to convert the data in the page to a Solr Indexed Core. If the last_modified_date is newer than the last_processed_at date-time then update the corresponding record in the Solr Index. If no corresponding record exists then a new record should be created.
      • Note: If the last_processed_at field is empty then there should be no corresponding index record in Solr Core and a new record will be created.  We need to avoid creating duplicate records in the Solr Index for each unique record in the pages table. (key: product_url)
    • After processing this record the last_processed_at date-time will be time stamped.
    • For each record Data clean up process will also be executed.  See details under “Data Clean Up Process”
    • This process will end once all the pages records have been processed.
  • Configuration property
    • Time lapse period for Expired pages in number of Days.
  • Database Tables / Solr Core
    • web_sites    - table to store Web sites information and settings for crawling
    • pages        - table to store page contents which are obtained from web sites
    • manufacturer_product - Searchable Solr Core Index for the product pages.
  • Performance:
    • Developers should consider that the database can hold millions of records.  Given this scenario Developers should optimize their solution for performance and efficiency.  
      • Developers are allowed to make changes to Database schema.  Such adding fields or Indices to make the conversion process more efficient.
  • Unit Tests:
    • Developers are expected to write Unit Tests to verify the logic implemented.  Developers can use tools such as JUnit and Mockito to help them write such tests.
 
Data Clean Up Process:
  • Each record processed for conversion may also be deleted from the Solr search Index if:
  1. Expired pages will be deleted.  If the page has not been updated since a specified time, e.g. 24 months.  Field last_modifed_at in the pages table will be used to determine the time of last update. .
  2. Broken links will be deleted.  If a page has already been deleted from the website. These maybe broken links which would be identified by the boolean flag “deleted” in the pages table.
  • This clean up process may be run independently of the Data Conversion process.  For example by running the Data Conversion program with a specific flag (--only-data-cleanup).

Mapping from WebCrawler mySQL DB to Solr Core
item mapping

ERD

 

 



DOCUMENTATION

  • Docker Images (MySQL, Solr)
  • Base Code (includes Gradle build script)
  • Test data
 

Final Submission Guidelines

FINAL DELIVERABLES
  • Full code that covers all the requirements.
    • A detailed README file including information on how to setup and run your application.

ELIGIBLE EVENTS:

Topcoder Open 2019

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30096518