Topcoder Challenge | Topcoder Community

Challenge Overview

In this challenge series we are building a simple tool that helps users with a database cleaning process to eliminate duplicate data following certain criteria defined by the user. Data is stored in MapR-DB tables and we will build various MapReduce/Hadoop jobs to manage that data.
In a previous challenge we have built data loading jobs (csv to MapR-DB table) and we have instructions for setting up the environment to work with jobs submitted to a MapR VM.
In this challenge we want to to build a MapReduce job for filtering the data based on a simple criteria. Input to the job is the data from the MapR-DB table (imported using the tool from the previous challenge). The job should go through all the data rows and insert them into one of two new tables

Clean data table (table name Final_#tablename) or
Manual audit table (table name Stage_#tablename)

You can decide on thr final table names as long as the above format is followed. The logic for separating out the rows into one of the two tables is this:

We have a csv file (data definition file) describing columns in the input data including column names and valid ranges.
For some of those columns, the input data has duplicate columns named "original_name_dup_#number" and only one of those values is expected to be in the allowed range (specified by PLOT_MIN, PLOT_MAX in the data definition file) and that is the value that should be used in the clean data table.
If more than one value is in the allowed range, we can't decide on the correct one, so the entire row should be added to the manual audit table that will later be reviewed by users (this will be built in later challenges).

For example, input file has three duplicate columns for CLEAN_RATE_1: clean_rate_1_dup_0 / clean_rate_1_dup_1/ clean_rate_1_dup_2. Allowed range for the CLEAN_RATE_1 variable is (0-100), so if the values are 200,50,300 - only second value is in the range so it will be used when inserting the row into target table.
Note, we're interested in processing data from the rtc_well table, not rtc_stage_data.

Data definition file, sample data and base code are posted in the forums.

Final Submission Guidelines

Submit the source code for the MapR job
Submit a short deployment/verification guide
Submit a short demo video (running the sample job) - unlisted youtube link

Quartz - Database Cleaning Tool - Data separation

Challenge Overview

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30064012