Challenge Overview
RFIT - Data Convergence POC
We have a banking client that has a lot of data sets in form of CSV files (in the order of thousands documents), that are exported from various systems they use. The data feeds have been independently specified, developed and maintained. As there are fewer sources of data than there are feeds and there are a limited number of business entities contained within these feeds (Products, Trades,Settlements, P&L and Reference Data etc.) there is a large amount of known and expected overlap and redundancy between these feeds. In order to reduce that duplication they have asked us to build a proof of concept Data Convergence Analysis tool that will do the following:
-
Identify primary keys in each of the data feeds
-
Merge the feeds by joining the feeds by primary key and removing duplicate columns - creating a master record
-
Provide reports on the merged data feeds
In a previous ideation challenge we got some ideas on how to identify primary keys for a dataset and how to merge multiple datasets, while merging columns containing duplicate data.
In this challenge we’ll implement those algorithms and create a web application for demonstration purposes. In a nutshell the web app will have these features:
-
Define a new entity (Products, Trades, Settlements, P&L, Reference Data etc)
-
Manage datasets for each of the entities (input files, feeds)
-
Manage column mappings for each of the feeds
-
Browse master dataset
-
Reporting section for comparing different data sets and displaying matching and non-matching attributes as well as matching and non matching rows. Also, a list of potentially matching attributes will be displayed.
Both frontend and backend is in scope of this challenge. NOTE: The challenge spec is long, but the scope is not too large, there is a total of 5 frontend pages and a quite straightforward backend. The core feature is defining master record and merging the data.
Backend will be implemented in Python (Flask or Django) and MongoDb will be used for persistence.
Frontend should be implemented as a single page app using React or Angular. There are no designs for this challenge, but presentation is an important part for demonstrating this POC, so make it look good and stick to standard design elements. You can use any libraries/frameworks for layout, styling, etc.
There is no authentication/authorization and all pages will be accessible by direct link.
A lot of details have been left out so you have some leeway when defining them. That is especially true for database schema definition and backend API design. Ask questions in the forums for anything that isn’t clear.
Keep in mind that we are building a POC tool, where the main goal is to demonstrate the features of the tool and the implemented algorithms.
Manage entities
The user will be able to define entities the tool will process the data for. The user should be able to create new entity and view a list of existing ones. Edit/Delete are not required. When adding a new entity, only entity name is required.
Clicking any of the entity names in the list will show the dataset management page for that entity.
Manage datasets
Each entity will have a list of source datasets. Dataset in this case will be just a source file (csv). For each of the datasets, show the file name, description, number of records, primary key column and a link to the column mapping page. Only description is entered by the user, other columns are populated by the backend.
When creating a new dataset, the user will select the input file and enter the description. Backend will process the file and determine column names (suggestions if the input file has no headers), data types, primary key for the file (one or more columns), and initial column mapping of the file to the master record. Once that is complete, the results will be shown to the user where he can make changes. For each of the columns show the following:
-
Column name (can be edited by the user)
-
Data type (can be edited by the user, options: Numeric, String, Date, Enumeration)
-
Toggle to mark the column as primary key
-
Mapped column - should be a dropdown with columns from the master record, option to add it as a new column and option to skip it entirely
Once the user has edited the dataset options, the values should be saved to database and the master records should be updated.
When parsing the new file in backend, use pandas to parse the csv file. Determine data type for each of the columns from the parsed data frame as one of 3 options: numeric, text, date. If data type is numeric or string, check if the column is an enumeration. Use a simple heuristic for this: if the number of unique values is less than 5% or the rows, or less than a configurable value (default to 10), consider the data type as enumeration.
Next, check if the file contains header row, and if it does not, create starting suggestions for column names as ‘Data type X’, for example ‘Date 1’, ‘Date 2’, etc. We’ll refine these suggestions in a later step.
Next step is to determine primary key for the data set. Initially all columns are candidates for primary key. Discard date and enumeration columns. For the remaining columns, take X% random samples from the data set and check for uniqueness (percentage should be configurable). Those that are unique on a sample data set, check for uniqueness on the entire data set. Those that are still unique are the candidates for primary key. If there is more than one, prefer string columns, columns that don’t contain [‘currency’,’net’,’gross’,’value’] (configurable list) in the header and those that do contain [‘id’, ‘identifier’, ’number’,’num’](configurable list) in the header.
Next we need to find a suitable column in master record that we can use to join master record and the new file. We won’t try to do anything complex here. Try to join the data set with master record data on each of the columns of master record of the same data type as the primary key and choose the column with largest number of matched records. Of course, if the master record is empty (no defined data sets yet), this step can be skipped as the first data set will be the master record, until another data set is added.
Now that we can join the new data set to the master record, we’ll try to detect duplicate data columns and suggest excluding them from the merge.To generate suggestions for column mapping to master record, follow the following procedure:
-
If the master record is empty, simply suggest all the columns should be the new columns of the master record
-
For each of the columns of the data set, each column of the master record is a match candidate if it is of the same type
-
If the column data type is numeric or date, just compare the values to other numeric/date columns in the master record (again, first do a X% random sampling of records that appear both in the new data set and in the master records).
-
If the data type is string, first perform an exact comparison to other string attributes of the master record (ignore: case, spaces, punctuation marks, underscores). If it is a match, then suggest that column as mapped column.
-
Determine if values in the column are uniquely determined by the values in one of the master record columns (for each unique value of the column from the master record, the uniquely determined column should have the same value). For example column File3#Feed is uniquely determined by File1#Book. If a column is uniquely determined by a column in the master record, then suggest that column as mapped column
-
Look if the column is a complex attribute formed by using other columns, in the form A_B-C D, ie multiple attributes joined by separator characters (underscores, dashes or spaces). Split the values in the column on separator characters, and check if they all have the same number of sections. If so, for each of the sections try to find an exact matching column in the master record. If all sections have an exact mapping, set the column mapping suggestion to list of those columns. For example if there is a column A in the data set, and it’s sections map to columns B and C in the master record, suggest that the column A should be mapped to attributes B and C.
-
Any columns that are matched to master record and don’t have header name in the input file, override the column name suggestion to be the same as the matched column (first match)
Once the user updates the mappings, save the master record configuration to the database, and merge the new input file data with data from the master records. Save the list of potential matches for each of the columns as well.
Manage column mappings
This page will display the current master record definition. For each attribute of the master record display attribute name, number of mapped columns and a list of columns from datasets that are mapped to that attribute. Show options to remove mapped columns or to add more mapped columns.
After the last attribute, add an option to create a new attribute and map columns to it.
If any mappings are changed, the master records should be recreated. As this is a POC, we can just recreate all the records from scratch, but in the future this can be expanded to do an intelligent merge (override just the changed columns).
Browse master record
This page should display a paginated list of entries in the master record, along with a search box to filter the results. Filtering will just search all columns of the master record for a match of the filter term (partial matches should be included too)
Reporting
Reporting page should allow the user to select two datasets and compare them. The comparison should display which columns are common for both datasets, which columns are only in the first one and which columns are only in the second one.
For each column of both data sets, display all the potentially matching columns in the master record.
Additionally add 3 paginated tables showing rows that are common to both data sets, rows that appear only in the first one, and rows that appear only in the second one.
Final Submission Guidelines
Submit the full source code for the POCSubmit a verification video
Submit a verification document explaining the merge process on the provided sample files.