Challenge Overview
In this challenge series we are building a simple tool that helps users with a database cleaning process to eliminate duplicate data following certain criteria defined by the user. Data is stored in MapR-DB tables and we will build various MapReduce/Hadoop jobs to manage that data.
In a previous challenge we have built data loading jobs (csv to MapR-DB table) and we have instructions for setting up the environment to work with jobs submitted to a MapR VM.
In this challenge we want to to build a MapReduce job for filtering the data based on a simple criteria. Input to the job is the data from the MapR-DB table (imported using the tool from the previous challenge). The job should go through all the data rows and insert them into one of two new tables
- Clean data table (table name Final_#tablename) or
- Manual audit table (table name Stage_#tablename)
- We have a csv file (data definition file) describing columns in the input data including column names and valid ranges.
- For some of those columns, the input data has duplicate columns named "original_name_dup_#number" and only one of those values is expected to be in the allowed range (specified by PLOT_MIN, PLOT_MAX in the data definition file) and that is the value that should be used in the clean data table.
- If more than one value is in the allowed range, we can't decide on the correct one, so the entire row should be added to the manual audit table that will later be reviewed by users (this will be built in later challenges).
Note, we're interested in processing data from the rtc_well table, not rtc_stage_data.
Data definition file, sample data and base code are posted in the forums.
Final Submission Guidelines
Submit the source code for the MapR job
Submit a short deployment/verification guide
Submit a short demo video (running the sample job) - unlisted youtube link