Challenge Overview
In this challenge, we want to set up the environment for future challenges using docker containers and import sample data into MarkLogic. This is the overall system diagram of what we will build
Kafka and Logstash will only be used to import sample data into Marklogic. Database storage will be configured with three tiers: highly available data (tier1 for example up to 6 months old), tier 2 (6-18 months) and tier 3 (archived data over 18 months old, stored on HDFS).
In a future challenge we will build a service that will search the Marklogic database data and provide reporting per data tiers.
Environment setup
We're looking for a docker compose script that will manage these containers:
-
3 node MarkLogic cluster (3 separate containers). HDFS and Kafka connectors should be installed on each cluster node and the 3 tiering levels should be configured.
-
Hadoop container (HDFS will be used as backing storage for Tier3 data)
-
Kafka (ML cluster should be using this container)
-
Logstash - will be used to push data from csv files to a Kafka topic
Here are some useful links to get started:
https://developer.marklogic.com/blog/building-a-marklogic-docker-container
https://docs.marklogic.com/guide/mapreduce/quickstart
https://developer.marklogic.com/products/mlcp
https://github.com/sanjuthomas/kafka-connect-marklogic/issues
Importing data from csv files
Sample data is attached in the forums. The files are export of a sample sql database and we will import them to Marklogic as is - we should not perform any denormalization. Tier 3 data should be stored in HDFS as a Marklogic Forest.
There are 3 tables in the sample data: Accounts, Instruments, Positions.
- Accounts are not transactional and should completely be stored in T1 storage. Primary Key is AccountID.
- Instruments data should be tiered by AddedDate column. Primary Key is InstrumentID. This is the information on the security that is traded. This does not have the information on which account is the instrument traded since, thinking logically, an instrument (say IBM) can be traded under any account.
- Positions data should be tiered by BusinessDate column. It is linked to an AccountId and InstrumentId. This has position for a particular instrument held for a business.
Final Submission Guidelines
Submit the docker environment setupSubmit the data import script
Submit query performance results
Submit deployment/verification guide