Challenge Overview

Challenge Summary

 

Welcome to "HP Data Quality ETL App Architecture Challenge". We want to build an Extract Transform and Load (ETL) application that will allow user to moves data between a source file or database and a destination directory or database. The goal of this challenge is to create the underlying architecture for this Data Quality app.

 

Challenge Details

Competition Task Overview

The main task of this competition is to build the underlying architecture for the HP Data Quality ETL App. The design for the app has been provided for reference and a prototype challenge is currently in progress to implement the same.

 

NOTE - We’re making a small change to design assets to add schema entry functionality to the Create New Job and Edit Job workflow.  We’ll provide the updated design assets in the forums as soon as it’s available

 

Background Overview

The HP Data Quality team is tasked with moving large amounts of data on a regular basis. They would like to development an Extract Transform and Load tool that allows them to move file and relational data on a regular basis and create a monitoring service which can display the status of the currently scheduled and recently completed jobs.  Often they are moving data between regular and HDFS file systems.

 

The application will need to handle the following data source types:  

- Fixed Width Text Files

- Delimited Text Files (such as csv files)

- JSON Files

- XML Files

- Excel Files

- Relational Databases (Oracle, SQL Server, MySQL)

- Columnar Databases (Redshift and Vertica)

- NoSQL Databases (MongoDB)

- Hadoop Databases (Hive)

 

Technology Considerations

 

1.  Java should be the primary development language.

2.  Use Maven as the build and dependency management tool.

3.  Build back-end services as separate REST-based application.

4.  Build front-end web interfaces using  AngularJS

5.  MySQL/ Amazon Aurora

6.  We don't need to worry about SSO for now.  We can just use basic or token based authentication for now.

 

Scope

 

We are interested in building the following components.

 

Delimited File Input Component

Delimited File Output Component

Excel Input

Excel Output

Database Input

Database Output

Vertica Input

Vertica Output

 

Please also note that to for the initial release  the tool is expected to have only one input source and one destination target although that is bound to change sooner or later, therefore the architecture must be flexible to support this.

 

Detailed Application Notes

 

Users

 

There will be 2 kind of users that will use this apps (User and Admin)

 

- Admin Users will have visibility to all Projects and Jobs in the System.  Normal Users will only have access to their own projects and jobs.

- It should be possible to view the summary of all activity in the apps, such as:

-- Projects  

-- Jobs

-- Currently Running Jobs

-- Upcoming Jobs

-- Completed Jobs

-- List of Active and Archived Projects.  

 

- User should be able to see summary of projects:

-- Project Name

-- Project Description

-- Created On Timestamp

-- Modified On Timestamp

-- Created By

-- Modified By

-- Status: Running/Total Jobs

-- Action buttons (Open project details, Edit Project, Archive Project)

 

Job Description

 

It is expected that this model includes

-- Job Name

-- Job Description

-- Creation Date

-- Created By

-- Modification Date

-- Modified By

-- Once Job Description done, click Next

 

Workspace

- Once the basic details of a job have been declared user should be able to enter a workspace where they can enter the details of their data migration job.  It has several parts

-- Component List - The ETL tool will also need a list of components.  In the toolbox there should be input and output versions of each of the components listed above.  (e.g Input Delimited Text File, Output Delimited Text File, Input XML File, Output XML File, etc..).  We don’t need detail out all of the components with the initial release of the application but let’s define the following fields for the following components:

 

Delimited File Input Component:

File Path (String)

Encoding (ISO-8859-15, | UTF-8)

Row Separator (LF(“\n”) | CR (“\r”) | CRLF (“\r\n”))

Field Separator (String)

Is CSV? (checkbox)

Escape Character (visible if is CSV is chosen) (String)

Text Enclosure (visible if is CSV is chosen) (String)

Header (Integer)

Footer (Integer)

Limit(Integer)

Trim Input? (checkbox)

 

Delimited File Output Component:

File Name (String)

Row Separator  (String)

Field Separator  (String)

Append (checkbox/bool)

Include Header (checkbox/bool)

Compress as Zip File (checkbox/bool)

IsCSV file? (checkbox/bool)

CSV Escape Character   (String)

CSV Text Enclosure   (String)

Encoding  (ISO-8859-15, | UTF-8)

 

Excel Input:

Excel Version (2007 - xlsx)

File Name (String)

Sheet Name (String)

Header (Int)

Footer (Int)

Limit(Int)

First Column (Int)

Last Column (Int)

Encoding  (ISO-8859-15, | UTF-8)

 

Excel Output:

Excel Version (2007 - xlsx)

File Name (String)

Sheet Name (String)

Include Header (bool)

 

Database Input

JDBC URL  (String)

Database Version (enumeration from Settings - pairs JAR Files and Database Types/Versions)

Class Name  (String)

Username (String)

Password (String)

Table Name (String)

Query  (String - Text Area)

 

Database Output

JDBC URL (String)

Database Version (enumeration from Settings - pairs JAR Files and Database Types/Versions)

Class Name (String)

Username (String)

Password (String)

Table Name (String)

Action on Data Insert | Update | Insert or Update | Update or Insert | Delete

Clear Data in Table?  (Boolean)

Query  (String - Text Area)

 

Vertica Input:

Version: (Vertica 7.0.X)

Host:  (String)

Port:  (String)

Database:  (String)

Schema: (String)

Username: (String)

Password: (Masked String)

Table: (String)

Query (String)

 

Vertica Output:

Version: (Vertica 7.0.X)

Host:  (String)

Port:  (String)

Database:  (String)

Schema: (String)

Username: (String)

Password: (Masked String)

Table: (String)

Action on Table:  (Create Table if does not exist | Create Table | Drop and Create Table | Drop Table if exists and Create Table | Clear Table

Action on Data: Insert | Update | Insert or Update | Update or Insert | Delete

 

 

Each component will have

 

-- Workspace - A job is created by dragging components into the workspace.  Each job will have two components - a source and a destination.

 

Mapping Interface

 

- One of the most important elements of the job is the mapping interface where source and destination fields are mapped to each other.

- When an input and output component both have a schema definition, users should be able to bring up a mapping screen which will allow users to map input fields to output fields and do some basic transformations.

 

Job Runner

- As jobs are created it will be important to actually execute a job in real time to verify results.  

- The app should provide some progress indications as a data is read from the source and written to the destination.

 

Documentation Provided:

- Design assets can be found in forums.

Final Submission Guidelines

For this challenge, we need to come up with a full design / architecture that address the above requirements.

 

The expected deliverable :

- TCUML containing all necessary diagrams (Classes/Models and Sequence diagrams)

- RESTful API Endpoint Documentation (preferred swagger documentation)

- Code Challenges Specifications

- ERD / SQL Scripts

- Application Design Specification

 

Please refer to the documentation (posted in challenge forums) provided to address the new requirements

 

Final Submission:

- For each member, the final submission should be uploaded to: Online Review Tool (https://software.topcoder.com/review)

 

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30056613