Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Background

The Bill & Melinda Gates Foundation’s (BMGF) Healthy Birth, Growth, and Development (HBGD) program addresses the dual problem of growth faltering/stunting and poor neurocognitive development, including contributing factors such as fetal growth restriction and preterm birth. The HBGD program is creating a unified strategy for integrated interventions to solve complex questions about (1) life cycle, (2) pathophysiology, (3) interventions, and (4) scaling intervention delivery. The Healthy Birth, Growth, and Development knowledge integration (HBGDki) Open Innovation platform was developed to mobilize the global “unusual suspects” data science community to better understand how to improve neurocognitive and physical health for children worldwide. The data science contests are aimed at developing predictive models and tools that quantify geographic, regional, cultural, socioeconomic, and nutritional trends that contribute to poor neurocognitive and physical growth outcomes in children. The tools and scripts developed by this challenge will support the data analysis efforts of the HBGDki Open Innovation initiative.

Description

The Gates Foundation is hoping to develop capabilities to allow their SAS programmers to become more productive. In this challenge stream, we’re going to develop an application which can dynamically read source data from SAS binary files and external data files, and generate SAS scripts to read and transform those files. However before we can process the input files and generate SAS scripts we need to prepare some documents and validate that the metadata files and binary files are in sync with each other and that the metadata files accurately describe the state of the input data. This challenge will produce a command-line java tool which has the following requirements:

1. Shell File Creation

Background: SAS analyst focused on medical research often have to generate “code” files from relational records. For example, let’s say we have subject record with id 101, and family income of $1000, who lives in home with a dirt floor. This can be represented in a single row with columns of SubjectId, Family Income, and Floor Type. However, this could also be represented in key value format in the following way with 2 rows instead of one:
+---------+------------------+----------+---------+-----------+--------+
| Subject | Variable         | Variable | Numeric | Character | Units  |

| ID      | Category         | Code     | Value   | Value     |        |
+---------+------------------+----------+---------+-----------+--------+
|     101 | Financial        | INCOME   |    1000 |           | $/year |
+---------+------------------+----------+---------+-----------+--------+

|     101 | Home Environment | FLOOR    |         | Dirt      |        |
+---------+------------------+----------+---------+-----------+--------+

In our metadata file (ex01_DDF.csv), there are two flags in Columns J and K:
IS_TARGET_FIELD_CODE_NAME
IS_TARGET_FIELD_CODE_VALUE
A ‘Y’ value in either of these columns indicates that the field is going to be used in key/value (or code/value) form. The first step in doing the conversion from one type of document to another is establishing the codes that are going to used as the code names. Attached is the output file that our application should create when the createCodeShellFile function is executed (SS_CodeShellFile.csv). This should be one of the command line options for the app. The purpose of this process is to create a document that an analyst can easily populate with the appropriate codes. Here are the requirements for the file:
1. The naming convention for these files should be output filename (e.g. ‘SS’) followed by the string “_CodeShellFile.csv”.
2. Copy columns A through E from the DDF document for the values flagged as ‘Y’ in the IS_TARGET_FIELD_CODE_NAME
3. Add a column (starting with Column F) for each Code Value field designated in the DDF file. The column header will be the string “CODE_” followed by the name of the output field.

2. Validation Reporting

Background: We’ve recently run a F2F challenge which revealed an Open Source library (Parso) which can interrogate the SAS binary format. It is doing some basic reporting in log4j type format on the Column Names and Types.

However, in this challenge, we’re going to produce a true validation report which is a bit more user friendly for non-java programmers. The validate functionality should be one of the command-line options. (The three command line functions in final application will be 1. Create Shell File, 2. Validate, 3. Generate SAS Script -- we’re only tackling #1 and #2 in this challenge.) The validation report should include the following information:
1. The name and full path of each binary input file found.
2. The number in records in each binary file.
3. The name and full path of the metadata file.
4. A report of column names that are found in the metadata file but not found in the source binary file
5. A report of column types which don’t match between the metadata file and the source binary file.
6. The name and full path of each mapping file identified in the MAPPING_FILE column of the metadata file.
7. The number of records present in each mapping file.
8. A report for each mapping file validating the csv format of the file. Please provide line number and the first invalid record found in each file.
9. Please ensure that there are no open quotes in the in the Code Snippet column. For example, “Example 1 rather than “Example 1”.
10. The validation process should look for the shell files described in Section 1 above. If there are ‘Y’ values in the IS_TARGET_FIELD_CODE_NAME column of the DDF file there should a corresponding shell file for the target file which contains the appropriate code values.
11. The validation solution should also verify that the names of code fields in shell files match up with the target field names in the metadata document. The code fields in the shell document have a prefix of “CODE_” + the actual field name.

Additional Requirements:
1. For the output to the console screen, it would be helpful to have a status update between each of the steps especially for the validation. For example, “Locating Input Files…” then listing the files found. “Counting number of records in the input files…” then listing the counts. This will be helpful especially when the files are large and it may take some time to complete processing.
2. Please use the existing code as a starting point.
3. Please create a maven build script for the app.


Final Submission Guidelines

- A Java application which works upon the data provided and generates the required output.
- A Deployment Guide.
- The previous application code, the data and metafiles files can be found in the Code Document forums attached to this challenge.
- SAS has a functional University Edition which may be helpful for learning about SAS scripting.

ELIGIBLE EVENTS:

2017 TopCoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30055498