Challenge Overview

Introduction

Welcome to the challenge. In this challenge, we will be implementing Elastic Search using Bonsai.io in a NodeJS Module. We have a  lot of microsites hosted on Heroku and built with NodeJS. We need, a nodeJS module, that allows a microsite developer to quickly plugin a search module. This is the first part of the module, where in we will build a spider CLI tool to index a site in Bonsai.io

Details

A microsite will have 2 types of users. Normal users and Expert Users. Items that appear in the search index should be tagged by their audience type(s). Each item can be tagged as normal,expert, or both.

Configuration 

All configuration of the modules should be populated from environment variables. The module should assume intelligent defaults, where feasible. A developer should have the ability to override the environment variables, by passing an additional options in the module.

Implementation Details

The module is called as Spider / Index Hydration Module

1. This module will be called by the developer to index the pages. This module should be a Node CLI Tool.
2. The spider should use a parameter to specify the base page path for the spider.  The spider will crawl the site (either on localhost, or Heroku) and transform the page text into a format acceptable to a ElasticSearch index.
3. The base path may be the root or a deeper folder within the site.  If a deeper folder is specified, only pages found within the folder (or children of the folder) should be included in the index.
4. The command should be able to exclude portions of the site by specifying page paths to exclude from spidering.
5. The command should be able to obey or ignore robots.txt (assumed to be located at /robots.txt)
6. Additional parameters should be considered for determining what portions of the page content is sent to the index and how it is segmented (e.g. title tag, meta tags, what html elements are search included in the information sent to the index)
7. Meta tag elements, including title tags, should be segmented out from the combine “content” of the page, to allow the index to give preferential ranking/treatment to these items.
8. Pages discovered in the spidering process should have the text content of elements specified in the “includeElements” concantanated into a single field of the index.
9. The command should have a flag to override any existing index.

Sample Parameters values, that must be supported. Please feel free to add additional parameters, if needed or useful. 
1. basePath : The base path used for the spider to discovery pages eg [./, http://www.example.com/,  https://www.example.com/somefolder
2. excludePaths: An optional array of page paths to exclude from spidering e.g [‘/folder1’, ‘/folder3’]
3. index: The name of the ElasticSearch index to submit the content to [e.g. example-search, example-search-expert ]
4. includeElements: An optional array of html elements that should be included in content transformation eg. [‘<div>’, ‘<li>’, ‘<p>’, ‘<h1>’]. (Provide a sensible default of HTML elements)
5. excludeElements: An optional array of html elements that should be excluded from content transformations, [‘<nav>’, ‘<header>’, ‘<footer>’] (Provide a sensible default of HTML elements)
6. obeyRobots: An optional flag to determine if the spider should reference and obey the robots.txt file on the site. e.g. [true, false] (default value false)
7. force: An optional flag to override existing values in the index specified A use case for this flag would be to re-index a website after content has been updated and/or deleted.(Default false)
8. maxDepthLevel An optional parameter to specify the number of children pages to be spidered. (Default 10)
9. An optional parameter to specify the maximum number of items to spider. (Default 500) 

Other requirements

1. The modules should be built using the latest stable release of NodeJS
2. The modules should be configurable using environment variables with default values that can be overridden
3. The module must be delievered as a Command Line tool. 
4. The tool should provide feedback to the user, while running. (eg. number of pages processed. etc)
5. Elastic Search index should be built in Bonsai.io. You can use the free sandbox version to develop and test your code.


 

Final Submission Guidelines

1.The CLI tool code must be well-documented. 
2. Please write unit tests to test the functionality. The tool  be testable using a framework such as Mocha / Chai and be able to be run as npm scripts. (Grunt / Gulp scripts are not permitted)
3. The code must be linted using a tool such as EsLint. Follow the guidelines at https://www.npmjs.com/package/eslint-config- airbnb
 

ELIGIBLE EVENTS:

2017 TopCoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30056419