Challenge Overview
We have a banking client that has a lot of data sets in form of CSV files (in the order of thousands documents), that are exported from various systems they use. The data feeds have been independently specified, developed and maintained. As there are fewer sources of data than there are feeds and there are a limited number of business entities contained within these feeds (Products, Trades, Risk, Settlements, P&L and Reference Data etc.) there is a large amount of known and expected overlap and redundancy between these feeds. In order to reduce that duplication they have asked us to build a proof of concept Data Convergence Analysis tool that will do the following:
Identifying primary keys
This is the first step in building the tool, so we're asking you to to design an algorithm for identifying primary key in a data set. It is assumed the primary key will be one or more columns so the proposed algorithm must be flexible enough to detect the primary keys reliably. No user intervention is expected to correct the output, ie the primary keys should be properly detected each time.
Creating a master record
Once all the primary keys are identified, the next step is to merge the data and create a master record - one that contains all the relevant data with as little duplication as possible. This is not an easy task as columns in two feeds might have different headers (or no headers at all, see file No 8), different formats (for example comma vs dot as decimal separator), conflicting data (full company name vs short name), etc. To solve these issues you should propose a second algorithm for creating the master record. You can assume that the initial master record will contain all the columns from the individual files and that your algorithm should refine it to join duplicate columns. Joining the columns should be an iterative procedure that includes algorithm for matching the columns and steps to include user feedback for improving column matching. For example if initial master record contains 100 columns, 20 of which are joined automatically in the first run(primary keys, duplicate data, etc) the end user can then specify rules that additional columns are merged, and that info should improve the algorithm for detecting candidate columns for merging (incorporating user feedback is a very important part of the overall task). If the algorithm can’t join one of the files based on the primary keys, a warning should be generated since no actual records would be merged in that case, so user intervention is required to avoid total data duplication. Bear in mind that the amount of data is large, so any brute force approach will probably not be a good fit.
Creating comparison reports
This section is here just to provide insight into what the above algorithms will be used for. You are not required to provide any recommendations for items in this section.
We'll build a dataset comparison tool (web app) that will provide the following insight into the data:
- Identify primary keys in each of the data feeds
- Merge the feeds by joining the feeds by primary key and removing duplicate columns - creating a master record
- Provide reports on the merged data feeds
General considerations
The proposed solution will need to consider that there is a large number of input files (10s of thousands) and the large number of records to be processed (tens of GB of data). These considerations will be especially relevant for detecting matching/non matching records, so the solution should scale both vertically and horizontally.
The client has provided us with sample data files - 9 files with 100 rows each and they are available here
Identifying primary keys
This is the first step in building the tool, so we're asking you to to design an algorithm for identifying primary key in a data set. It is assumed the primary key will be one or more columns so the proposed algorithm must be flexible enough to detect the primary keys reliably. No user intervention is expected to correct the output, ie the primary keys should be properly detected each time.
The proposed algorithm should be deterministic and should rely (but not be limited) to the features found in the data set:
- Column headers
- Data types
- Data variance/deviation/covariance of columns
- Sampling small data sets from the larger data set
Creating a master record
Once all the primary keys are identified, the next step is to merge the data and create a master record - one that contains all the relevant data with as little duplication as possible. This is not an easy task as columns in two feeds might have different headers (or no headers at all, see file No 8), different formats (for example comma vs dot as decimal separator), conflicting data (full company name vs short name), etc. To solve these issues you should propose a second algorithm for creating the master record. You can assume that the initial master record will contain all the columns from the individual files and that your algorithm should refine it to join duplicate columns. Joining the columns should be an iterative procedure that includes algorithm for matching the columns and steps to include user feedback for improving column matching. For example if initial master record contains 100 columns, 20 of which are joined automatically in the first run(primary keys, duplicate data, etc) the end user can then specify rules that additional columns are merged, and that info should improve the algorithm for detecting candidate columns for merging (incorporating user feedback is a very important part of the overall task). If the algorithm can’t join one of the files based on the primary keys, a warning should be generated since no actual records would be merged in that case, so user intervention is required to avoid total data duplication. Bear in mind that the amount of data is large, so any brute force approach will probably not be a good fit.
This section is here just to provide insight into what the above algorithms will be used for. You are not required to provide any recommendations for items in this section.
We'll build a dataset comparison tool (web app) that will provide the following insight into the data:
- row intersect - rows that are present in both data sets A and B
- row compliments - rows that are present in data set A but not in B
- attribute intersects - attributes of the set A that are present in set B (and those would be merged in the master record)
- attribute compliments - attributes of the set A that are not present in set B
- attribute differences - common attributes that contain different data values