Challenge Overview
Challenge Objectives
-
Develop a Java application based on Kafka streams
-
Evaluate Drools rules on messages coming from Kafka and write the output to HDFS
-
Load reference data for kafka messages from HBase
Project Background
-
A combination of Spark Streaming and Drools was used to create a (near) real-time rules engine. where messages are delivered from an incoming Kafka topic to a Spark Streaming job which applies the Drools rules, and output is then placed onto an output Kafka topic.
-
this approach has downsides. Applying Drools rules to data isn’t something that requires an advanced streaming engine, and so Spark is arguably overkill unless it’s being used elsewhere in the solution. This is true of many of the ETL implementations of Spark. Given that Cloudera’s solution uses Kafka to move data to and from the Spark Streaming pipeline, it would be useful if Kafka itself could apply the Drools, cutting out Spark entirely.
Technology Stack
-
Java 8
-
Linux
-
Kafka
-
HDFS
-
Hive
-
HBase
Code access
You will create a new Java project using Maven as the build tool. Coding standards document is provided in the forums and must be followed.
Individual requirements
The end goal is to integrate Kafka Streams with Drools and provide a solution for allowing Drools to be applied in real-time, without the overhead and resource requirements of installing and maintaining Spark, especially in situations where Kafka is already being used as the messaging system.
In this challenge we will build an application that does the following
-
Creates a Kafka Streams pipeline to read data from a Kafka topic. We can assume that each message comming from Kafka is JSON message, but doesn't have fixed attributes.
-
Retrieves reference data from Apache HBase - here we can assume that the message contains an attribute which can be used to select additional data from HBase. You can create any HBase table to demonstrate this feature. Example: suppose that original message contains attributes: transactionID, bankID, amount. We can use bankID to select bank details from "Bank" table (ex SELECT * FROM BANK WHERE ID=bankID) so the combined message would be transactionId, bankID, amount, bankName, bankCountry.
-
Evaluate Drules rules (.drl) on the combined data. Create a sample drl file with a decission table to demonstrate this feature.
-
Push the evaluated values and the original input message to a new Kafka topic
-
Configure Kafka to push the messages in the output topic to HDFS using Kafka Connect HDFS connector.
All Kafka and HBase connection parameters, topic and table names, should be configurable via properties file and should be set via environment variables. Any errors should be logged to a log file and the message that caused the error should be sent to a dead letter queue - a separate topic in Kafka.
Deployment guide and validation document
Two documents are required for validation
A README.md that covers:
-
<Deployment> How to deploy the app locally
-
<Configuration> Document all the configurable parameters
-
<Dependency installation> How to set up Kafka, HDFS, HBase, Kafka HBase connector, etc. You can use Cloudera to set up the required environment, or create the necessary services using Docker.
A Validation.md that covers:
-
How to run the application and verify the workflow. Create a simple producer to push some messages into Kafka input topic and trigger the pipeline processing.
-
To verify the data is written to HDFS, Render a Hive table using HUE
What To Submit
-
Submit the complete code base
-
Submit verification documents
Final Submission Guidelines
-
Submit the complete code base
-
Submit verification documents