Challenge Overview

Our customer is currently using a SQL database for their application. The data in this database is constantly growing and they are migrating to the MarkLogic NoSQL database. MarkLogic has a tiered data storage feature that allows archival data to be stored on cheaper storage. The customer wants to optimise the performance by implementing the best tierring strategy so that it isn’t slower than the SQL database.

In this challenge, we want to set up the environment for future challenges using docker containers and import sample data into MarkLogic. This is the overall system diagram of what we will build

Kafka and Logstash will only be used to import sample data into Marklogic. Database storage will be configured with three tiers: highly available data (tier1 for example up to 6 months old), tier 2 (6-18 months) and tier 3 (archived data over 18 months old, stored on HDFS).

In a future challenge we will build a service that will search the Marklogic database data and provide reporting per data tiers.

 

Environment setup

 

We're looking for a docker compose script that will manage these containers:

  • 3 node MarkLogic cluster (3 separate containers). HDFS and Kafka connectors should be installed on each cluster node and the 3 tiering levels should be configured.

  • Hadoop container (HDFS will be used as backing storage for Tier3 data)

  • Kafka (ML cluster should be using this container)

  • Logstash - will be used to push data from csv files to a Kafka topic

 

Here are some useful links to get started:

https://developer.marklogic.com/blog/building-a-marklogic-docker-container

https://docs.marklogic.com/guide/mapreduce/quickstart

https://developer.marklogic.com/products/mlcp
https://github.com/sanjuthomas/kafka-connect-marklogic/issues

 

 

Importing data from csv files

 

Sample data is attached in the forums. The files are export of a sample sql database and we will import them to Marklogic as is - we should not perform any denormalization. Tier 3 data should be stored in HDFS as a Marklogic Forest.
There are 3 tables in the sample data: Accounts, Instruments, Positions.
  • Accounts are not transactional and should completely be stored in T1 storage. Primary Key is AccountID.
  • Instruments data should be tiered by AddedDate column. Primary Key is InstrumentID. This is the information on the security that is traded. This does not have the information on which account is the instrument traded since, thinking logically, an instrument (say  IBM) can be traded under any account.
  • Positions data should be tiered by BusinessDate column. It is linked to an AccountId and InstrumentId. This has position for a particular instrument held for a business.
Write a simple logstash script to import the data into Marklogic by pushing it to Kafka topics. To verify the data is imported correctly, provide a few Marklogic queries to select the imported data and report query performance. A template for the required queries and performance data is available in the forums.

 



Final Submission Guidelines

Submit the docker environment setup
Submit the data import script
Submit query performance results
Submit deployment/verification guide

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30065817