Challenge Overview
Challenge Summary
Welcome to "HP Data Quality ETL App Architecture Challenge". We want to build an Extract Transform and Load (ETL) application that will allow user to moves data between a source file or database and a destination directory or database. The goal of this challenge is to create the underlying architecture for this Data Quality app.
Challenge Details
Competition Task Overview
The main task of this competition is to build the underlying architecture for the HP Data Quality ETL App. The design for the app has been provided for reference and a prototype challenge is currently in progress to implement the same.
NOTE - We’re making a small change to design assets to add schema entry functionality to the Create New Job and Edit Job workflow. We’ll provide the updated design assets in the forums as soon as it’s available
Background Overview
The HP Data Quality team is tasked with moving large amounts of data on a regular basis. They would like to development an Extract Transform and Load tool that allows them to move file and relational data on a regular basis and create a monitoring service which can display the status of the currently scheduled and recently completed jobs. Often they are moving data between regular and HDFS file systems.
The application will need to handle the following data source types:
- Fixed Width Text Files
- Delimited Text Files (such as csv files)
- JSON Files
- XML Files
- Excel Files
- Relational Databases (Oracle, SQL Server, MySQL)
- Columnar Databases (Redshift and Vertica)
- NoSQL Databases (MongoDB)
- Hadoop Databases (Hive)
Technology Considerations
1. Java should be the primary development language.
2. Use Maven as the build and dependency management tool.
3. Build back-end services as separate REST-based application.
4. Build front-end web interfaces using AngularJS
5. MySQL/ Amazon Aurora
6. We don't need to worry about SSO for now. We can just use basic or token based authentication for now.
Scope
We are interested in building the following components.
Delimited File Input Component
Delimited File Output Component
Excel Input
Excel Output
Database Input
Database Output
Vertica Input
Vertica Output
Please also note that to for the initial release the tool is expected to have only one input source and one destination target although that is bound to change sooner or later, therefore the architecture must be flexible to support this.
Detailed Application Notes
Users
There will be 2 kind of users that will use this apps (User and Admin)
- Admin Users will have visibility to all Projects and Jobs in the System. Normal Users will only have access to their own projects and jobs.
- It should be possible to view the summary of all activity in the apps, such as:
-- Projects
-- Jobs
-- Currently Running Jobs
-- Upcoming Jobs
-- Completed Jobs
-- List of Active and Archived Projects.
- User should be able to see summary of projects:
-- Project Name
-- Project Description
-- Created On Timestamp
-- Modified On Timestamp
-- Created By
-- Modified By
-- Status: Running/Total Jobs
-- Action buttons (Open project details, Edit Project, Archive Project)
Job Description
It is expected that this model includes
-- Job Name
-- Job Description
-- Creation Date
-- Created By
-- Modification Date
-- Modified By
-- Once Job Description done, click Next
Workspace
- Once the basic details of a job have been declared user should be able to enter a workspace where they can enter the details of their data migration job. It has several parts
-- Component List - The ETL tool will also need a list of components. In the toolbox there should be input and output versions of each of the components listed above. (e.g Input Delimited Text File, Output Delimited Text File, Input XML File, Output XML File, etc..). We don’t need detail out all of the components with the initial release of the application but let’s define the following fields for the following components:
Delimited File Input Component:
File Path (String)
Encoding (ISO-8859-15, | UTF-8)
Row Separator (LF(“\n”) | CR (“\r”) | CRLF (“\r\n”))
Field Separator (String)
Is CSV? (checkbox)
Escape Character (visible if is CSV is chosen) (String)
Text Enclosure (visible if is CSV is chosen) (String)
Header (Integer)
Footer (Integer)
Limit(Integer)
Trim Input? (checkbox)
Delimited File Output Component:
File Name (String)
Row Separator (String)
Field Separator (String)
Append (checkbox/bool)
Include Header (checkbox/bool)
Compress as Zip File (checkbox/bool)
IsCSV file? (checkbox/bool)
CSV Escape Character (String)
CSV Text Enclosure (String)
Encoding (ISO-8859-15, | UTF-8)
Excel Input:
Excel Version (2007 - xlsx)
File Name (String)
Sheet Name (String)
Header (Int)
Footer (Int)
Limit(Int)
First Column (Int)
Last Column (Int)
Encoding (ISO-8859-15, | UTF-8)
Excel Output:
Excel Version (2007 - xlsx)
File Name (String)
Sheet Name (String)
Include Header (bool)
Database Input
JDBC URL (String)
Database Version (enumeration from Settings - pairs JAR Files and Database Types/Versions)
Class Name (String)
Username (String)
Password (String)
Table Name (String)
Query (String - Text Area)
Database Output
JDBC URL (String)
Database Version (enumeration from Settings - pairs JAR Files and Database Types/Versions)
Class Name (String)
Username (String)
Password (String)
Table Name (String)
Action on Data Insert | Update | Insert or Update | Update or Insert | Delete
Clear Data in Table? (Boolean)
Query (String - Text Area)
Vertica Input:
Version: (Vertica 7.0.X)
Host: (String)
Port: (String)
Database: (String)
Schema: (String)
Username: (String)
Password: (Masked String)
Table: (String)
Query (String)
Vertica Output:
Version: (Vertica 7.0.X)
Host: (String)
Port: (String)
Database: (String)
Schema: (String)
Username: (String)
Password: (Masked String)
Table: (String)
Action on Table: (Create Table if does not exist | Create Table | Drop and Create Table | Drop Table if exists and Create Table | Clear Table
Action on Data: Insert | Update | Insert or Update | Update or Insert | Delete
Each component will have
-- Workspace - A job is created by dragging components into the workspace. Each job will have two components - a source and a destination.
Mapping Interface
- One of the most important elements of the job is the mapping interface where source and destination fields are mapped to each other.
- When an input and output component both have a schema definition, users should be able to bring up a mapping screen which will allow users to map input fields to output fields and do some basic transformations.
Job Runner
- As jobs are created it will be important to actually execute a job in real time to verify results.
- The app should provide some progress indications as a data is read from the source and written to the destination.
Documentation Provided:
- Design assets can be found in forums.Final Submission Guidelines
For this challenge, we need to come up with a full design / architecture that address the above requirements.
The expected deliverable :
- TCUML containing all necessary diagrams (Classes/Models and Sequence diagrams)
- RESTful API Endpoint Documentation (preferred swagger documentation)
- Code Challenges Specifications
- ERD / SQL Scripts
- Application Design Specification
Please refer to the documentation (posted in challenge forums) provided to address the new requirements
Final Submission:
- For each member, the final submission should be uploaded to: Online Review Tool (https://software.topcoder.com/review)