Topcoder Challenge | Topcoder Community

Challenge Overview

CHALLENGE OBJECTIVES

Spring Boot Application Development.
Build a web crawler to crawl product information from certain websites and store the obtained product data in the database.
Unit tests are needed.

PROJECT BACKGROUND

The customer wants to build a system to search web pages of various products from the internet. This system provides REST-based API and users can search with keywords and obtain information about product page from the API. Web Crawler is one of the fundamental compoemnets in the system, which will crawl in a specified web site and obtain pages there.

TECHNOLOGY STACK

Java 8
Spring Boot 1.5.7
MySQL 8.0

GENERAL REQUIREMENTS

The crawler should be built in a generic and reusable way such that it would be easy to add new websites to be crawled.
In this challenge we are only going to target a single website, the web site will be announced in the forum.
Submission must include Unit Tests and brief docs for setup & usage

DETAILED REQUIREMENTS
Here are detailed descriptions about how the web crawler is supposed to work.

The basic design is created with being inspired by the architecture of Scrapy, which is a popular web crawler/scraper framework product written in Python. It should be useful for you to read the architecture document of Scrapy.

1. Crawling for Website

Run with the parameter `site`, which specifies the ID of the target site
- e.g. --site=1
Load settings from `web_sites` table, which are for crawling for the web site
- - Items in settings are:
  - `url` - URL to start crawling
  - `content_url_patterns` - Patterns of URL to indicate which pages should be stored in the database
Start crawling the URL specified in the settings (a request for the URL is enqueued for subsequent processes)
Each request to get some page should be processed at regular intervals for politeness (e.g. 1 second interval)
- Each request is picked from the queue and processed one by one
- Any process to get a page should not be blocked by another process downloading pages, which is taking longer time than the specified “politeness” interval (each process should be run in a separate thread)
- The same destination page maybe linked from several source pages. In this case, this page may already have been processed from a previous thread during the execution and should be skipped.
If the elapsed time from the start reaches the time limit for a crawling process, Web Crawler should stop creating a new request and wait until all remaining requests are finished. (See: 4. Configurable properties)

2. Downloading pages (Processing requests to get pages)

Try to get contents of URL specified in a request
- If there’s a record which has the target URL in `pages` table, add If-Modified-Since / If-None-Match headers to the request
  - Set `last_modified` to If-Modified-Since, `etag` to If-None-Match (if each value is not null)
  - When getting 304 response (Not Modified), skip further processing for this page, and output a log
- When getting some 30x response and "Location" header, enqueue a new request for the URL in Location header
- When getting some 40x response, dump error logs and contents of the page
- When getting some 50x response or being timed-out, dump error logs and re-try after an interval (this should have a limit for re-trying count [See: 4. Configurable properties])

3. Handling page contents (Parsing and Storing page contents)

Extract all URLs from page contents and then enqueue requests for these URLs for processing them later.
- URLs should be normalized
  - Convert relative URLs to absolute ones
  - Removing hash information
- Filter out external URLs
- Filter out the URLs, which have already been processed in the same execution
- If the depth of the request reaches the maximum depth, Web Crawler should not create any more request for URLs in the page. (See: 4. Configurable properties)
If the URL of a page matches one of `content_url_patterns` defined in the settings, the page contents should be parsed and then stored in the database
- Create or update a record in the `pages` table with following data
  - Extract values of "ETag" and "Last-Modified" from response headers
    - ETag -> `etag`
    - Last-Modified -> `last_modified`
  - Extract data in <title> and <body> from the page contents (HTML)
    - title -> `title`
    - body -> `body`
- Create records in the `destination_urls` table with following data
  - Extract all URLs
    - URLs should be normalized
      - Convert relative URLs to absolute ones
      - Removing hash information
    - Filter out external URLs
  - If a record already exists in the database, ignore it.

4. Configurable properties

Interval between each subsequent request (milliseconds)
Time limit for crawling an entire single site (seconds)
Timeout for downloading a page (minutes)
Max number of times to retry a single page.
Max depth that will be allowed to crawl for a site

5. Database Tables

web_sites - table to store Web sites information and settings for crawling
pages - table to store page contents which are obtained from web sites
source_urls - table to store relations between some source URLs and pages. *This is not used in the scope of this challenge.
destination_urls - table to store relations between pages and some destination URLs.

DOCUMENTATION

Docker Images (MySQL)
Base Code (includes Gradle build script)
Sample data / Data dictionary
DDL/ERD

Final Submission Guidelines

FINAL DELIVERABLES

Full code that covers all the requirements.
A detailed README file including information on how to setup and run your application.

Product Information Search Service Web Crawler

Challenge Overview

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30095381