Challenge Overview
Challenge Overview
Welcome to the Pioneer Technical Architecture Updates Challenge
In previous challenges we have defined the high level architecture for the event streaming platform. In this challenge we will expand the architecture with technical details on handling multi-tenancy, disaster recovery and data seeding.
Background
We are building a scalable event streaming platform that will be used to provide customers with real time notifications on events that occurred in the system. Specific use case that we’re targeting is in the financial sector, but the platform will be designed as a generic event streaming solution. Scalability is a major concern as the solution would be used to process millions of events daily.
Event streaming platform will consist of three parts:
-
Producer - that ingests source data into Kafka cluster
-
Aggregation and filtering of the source data
-
Delivery of the events to end users
Most of the source data is generated in real time (ex Bob sent $5 to Alice) and some data is generated during night work process (ex balance for Bob’s account is $10). Regardless of how the data is generated, it is available in Kafka topics and will be used by our Producer to send event notifications.
See the project architecture document for more information (posted in challenge forums). You should read and understand the existing architecture before reading the challenge requirements below.
Task Details
Your task in this challenge is to add the following features to the system architecture:
-
Multi-tenancy - how to make sure the end client can only fetch data intended for them
-
Data seeding - how to populate initial data values that were generated before the streaming process started (for example how to get account balance for John if we are only streaming updates to accounts and John hasn’t used his account in a while)
-
Disaster recovery - define steps to recover from different disaster scenarios (ex Kafka cluster unavailable for a period of time, storage issues, network connectivity, etc)
Multi tenancy
Data sources for the Pioneer event streaming platform contain data for multiple end clients and we need to make sure that delivery topics, to which the end clients can subscribe to, only have data for that particular client. That in turn means that either the source topics (Tn, Wn) for ksql queries have to be client specific, or that the queries have to contain filtering per client identifier. The latter option is probably harder to implement since ksql queries are defined by the end clients and it would be the subscription api that would have to modify the queries on the fly. First option then also means that either “Kafka Stream apps” have to push data into client specific topics, or that the source data for those apps has to be client specific (“Ingested data” block) - and that would then have to be handled in the Kafka Connect layer. Your task is to propose the most suitable layer for separating the data by client, while taking into account pros and cons of each layer - separating the data in the ingestion layer has a downside that all downstream jobs also have to be run for each client, so we end up with many ksql jobs, but also provides some flexibility to simply not run some type of “Kafka Stream apps” queries for some clients (ie that could be managed with the data ingestion management tool). On the other hand, separating the data later in the process would make managing the tasks easier (there are less tasks), but also has less flexibility - all tasks are run for all clients.
In addition to proposing the layer that the data should be separated into client specific topics, you also need to provide technical details on how that will be enforced and managed. Do we use Kafka Access Control Lists to control which query can access which topics? How do we know which client can access which topics in the first place? Do we need to manage mapping of client->topics list, or do we use client specific topic names?
Data seeding
Streaming data updates are great for delivering actual updates, but what happens with the data that was generated before the actual data aggregation and transformation has started? For example if there is a data source that provides account balance updates, how do we ensure that we can retrieve the latest data for accounts that didn’t have any activity after the data aggregation job is started - this can mean a job in the ingestion layer (Kafka Connect), a job in the aggregation layer (Kafka Stream apps) or a job in client aggregation layer (client ksql) - each of those layers can have different consequences for downstream layers (what data is available, when, and under what conditions the end client would first get the seed data and then the streaming data updates). Your task is to explore the different options for generating the seed data. Some ideas for this are letting the users define a “seed job” that runs before the actual streaming data job and sends the data to the same sink topic - this is just a raw idea - your proposal should also consider how data seeding at each layer affects the propagation of the data to the end client (for example, if seed data is generated in the ingestion layer, end client might not receive it if the end client creates a subscription two months later (assuming the retention period for kafka topics is shorter than that).
Disaster recovery
System errors can happen at various layers and can have a lot of different causes, but planning for disaster recovery can mitigate the effects of those errors. Your task is to define (list) the major disaster scenarios that are possible in the Pioneer Event Streaming platform and propose the recovery steps for them. It is up to you to analyze the architecture and explore which components of the system can fail and in what ways - network issues (temporary, short, long), storage issues (unavailable, corrupted), kafka broker issues, kafka rebalancing errors, etc. Each of them will have different effects on the system components, but ultimately it is important to know under which scenarios the platform will be able to fully recover and deliver the full data to the end clients (with possible delays or service interruptions) or scenarios that will lead to some streaming events not being delivered to the end client (and may require restarting the data flow and initiating the data seeding process again)
Submission Guidelines
You should update the existing architecture document and add new sections for the above topics. Make sure to update the existing sections where needed.