Challenge Overview

Challenge Overview

Welcome to the Pioneer Data Ingestion and Delivery Ideation Challenge

In a previous challenge we generated a couple of ideas to create the architecture of the event streaming platform we intend to build. In this challenge we will expand on those ideas to find more flexible solutions for ingesting the data and delivering it to the clients reliably.

Background

We are building a scalable event streaming platform that will be used to provide customers with real time notifications on events that occurred in the system. Specific use case that we’re targeting is in the financial sector, but the platform will be designed as a generic event streaming solution. Scalability is a major concern as the solution would be used to process millions of events daily.

Event streaming platform will consist of two parts:

  1. Producer - that aggregates source data streams and produces the events and

  2. Consumer - that will be consuming the generated events and providing integration points for downstream applications

 

Most of the source data is generated in real time (ex Bob sent $5 to Alice) and some data is generated during nightwork process (ex balance for Bob’s account is $10). Regardless of how the data is generated, it is available in Kafka topics and will be used by our Producer to send event notifications. 

 

Consumer is a tool that would be installed in a customer environment and would pick up events from the producer and store them on the customer infrastructure (ex in flat files).

 

Customers will be able to subscribe to event types they are interested in and receive only those events with their consumer instance. Also, source data is coming from a multi-tenant platform and the producer will need to filter the data for a particular customer.

 

Event data should be delivered to the consumer in relatively short time, a few seconds, (in the common case with no network issues or other edge cases), and should be transmitted only once, to avoid flooding the network traffic, and finally all event notifications need to be streamed in order they happened (out of order delivery is not acceptable).

 

Both producer and consumer will be able to tolerate platform downtimes up to 72 hours (ie no events lost in that time), and will need to handle disaster recovery scenario (ex retransmit all the events in the last 24hrs)

 

Producer will be able to handle the load of ~100M events per day total - and would need to scale appropriately with increase in the number of events (note that not all events would need to be streamed - that depends on the event types that each customer is subscribed to)

 

Consumer will need to use minimal infrastructure, as it will be installed in a client environment where we will have minimal access and different clients will have very different infrastructure in their data centers. It will define a generic interface so that customers can implement their own event handlers (ex send data to APIs, or message queues), and we will implement a default handler that simply stores the events in flat files.

 

Data seeding will be supported by both producer and consumer - ex producer exporting events for a time interval (ex 3 months) into a flat file and consumer loading those events from a flat file, to avoid transmitting such large amounts of data over the network.

 

Previous work

In the previous challenge we got two submissions with different proposals on the overall (extremely high level) system architecture summary:

  1. Load the input data into a new Kafka instance using Kafka Connect, and deliver them to the clients via a custom tool that reads data from Kafka and sends it to the client using RSocket protocol

  2. Load the input data into a new Kafka instance using Spring Cloud Data Flow Streams, and deliver the data to the clients by exposing the raw Kafka topics and using ACL for access management.

Both of these submissions are available in the challenge forums.

 

Kafka is a very stable and proven technology so we will definitely use it for storing the event data for individual subscriptions, so that part of the system architecture is fairly clear. On the other hand, we still want to explore the options for the two remaining parts of the system:

  1. Data ingestion and

  2. Data delivery

 

The main reason we are looking for more ideas on the data ingestion component is that we want to build a more generic solution that can be used in the future for different systems - any scenario where clients are subscribing to the event data with different filtering criteria on the source data. To get there we need a good overview on how to make that data filtering very flexible. For instance, instead of having fixed categories of source data to subscribe to, can we use some matching criteria to filter the event data, or even some query language to filter the data on the fly when reading from the data source. Or is there even some more flexible way to achieve the filtering? The end goal is to let the end users define their own filtering criteria, while also ensuring we respect the multi-tenant nature of the underlying data. For example, if the user provided filter is something like “account transactions where the amount is larger than $100”, we should be able to first filter the data per tenant and then apply the user provided filter - effectively combining the two filters. Additional questions that need to be answered for data ingestion component are:

  • If a filtering criteria is complex or isn’t efficient for filtering large amounts of data, can we fall back to building a custom data processing tool - ie a plugin to the data ingestion component (ex instead of filtering just the event data payload, the filter criteria might need to join it with some external lookup data)

  • Are there any deployment or scaling constraints - how to deploy all the workers that filter the source event data

The two provided submissions suggest using Kafka Connect, or Spring Cloud Data Flow and each of those has their pros and cons and aren’t covered in great detail in previous submissions. We don’t want to limit the choice to just those two, but are looking for the most flexible solution for data ingestion.

 

Data delivery component architecture in the two provided submissions is very different - reading data from kafka on the server side and delivering it via RSocket protocol, vs exposing the raw Kafka topics to the end client. Exposing Kafka topics to the end clients would be a perfectly adequate solution where it not for the fact that we want to ensure exactly once delivery, and with Kafka consumers, the topic offset is maintained by the consumer, so clients can read the data many times, not just once, and therefore increasing the network bandwidth costs significantly. RSocket solution on the other hand reads the data from Kafka on the server side, so it eliminates the multiple data delivery problem, but is not investigated in depth:

  • How to ensure exactly once delivery when reading the data in bulk - how to confirm the message reception and handle network errors

  • What kind of performance can be expected for both reading the data from Kafka and delivering it to the client? What happens if client subscribes to the same rsocket stream multiple times - do we retransmit all the data for that stream again and do we have the same “multiple reads” issue?

Again, we don’t want to limit the options to just these two approaches and are looking for the most adequate solution.

 

Task Details

Your task in this challenge is to find suitable solutions for the issues mentioned above and document the pros and cons of each one and finally recommend the most suitable ones for building data ingestion and delivery components. Your final document should document the complete system architecture end to end, including the intermediate Kafka storage, subscriptions metadata storage, etc - we don’t want to refer to the previous submissions for the details of the final architecture proposal. You can reuse sections from the previous submissions, but make sure to present a coherent document explaining the entire architecture.

 


Final Submission Guidelines

See above

ELIGIBLE EVENTS:

2021 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30146247