Topcoder Challenge | Topcoder Community

Challenge Overview

Background

The Bill & Melinda Gates Foundation’s (BMGF) Healthy Birth, Growth, and Development (HBGD) program addresses the dual problem of growth faltering/stunting and poor neurocognitive development, including contributing factors such as fetal growth restriction and preterm birth. The HBGD program is creating a unified strategy for integrated interventions to solve complex questions about (1) life cycle, (2) pathophysiology, (3) interventions, and (4) scaling intervention delivery. The Healthy Birth, Growth, and Development knowledge integration (HBGDki) Open Innovation platform was developed to mobilize the global “unusual suspects” data science community to better understand how to improve neurocognitive and physical health for children worldwide. The data science contests are aimed at developing predictive models and tools that quantify geographic, regional, cultural, socioeconomic, and nutritional trends that contribute to poor neurocognitive and physical growth outcomes in children. The tools and scripts developed by this challenge will support the data analysis efforts of the HBGDki Open Innovation initiative.

Description
The Gates Foundation is hoping to develop capabilities to allow their SAS programmers to become more productive. In this challenge stream, we’re going to develop an application which can dynamically read source data from SAS binary files and external data files, and generate SAS scripts to read and transform those files. In the last challenge we developed a simple command line application that prepares some documents and validates that the metadata files and binary files are in sync with each other and that the metadata files accurately describe the state of the input data.
Now we’re going to expand on that application to actually create SAS transformation scripts based on a data file in SAS binary format and a metadata file in csv format.
We’re providing two sets of data with this challenge:
1. The Example 1 data set has the following input files:

- ex01_subj.sas7bdat
- ex01_visit.sas7bdat

The metadata file for the data set is: ex01_DDF.csv. The purpose of the Generate Script command described below in the requirements is to generate a Program.sas script. The Program.sas file is provided as an example. When executed this script generates 4 binary files as described in the ex01_DDF.csv:

- subj.sas7bdat
- anthro.sas7bdat
- parents.sas7bdat
- ss.sas7bdat

2. The other data set provided is the NYPD_Collisions data set. Here only the nypd_collisions.sas7bdat file is provided along with the nypd_collisions_ddf.csv. Your application is supposed to generate the SAS script for these files which will partition the data into summary.sas7bdat and accident_code.sas7bdat files.

It’s helpful to have the ex01_DDF file open in understanding the requirements listed below.

Requirements

1. The script will write data from the source data files (Column A - ORIGINAL_DATANAME) and columns (Column B - ORIGINAL_VARNAME) to the output data files described in Column F - TARGET_DATANAME and column names designated by Column G - TARGET_VARNAME.
2. The target data type should match those designated in the file.
3. Column E declares the cardinality of the output related to the input. For example, in ex_01, the keys show that there should be one row in each of the SUBJ, PARENTS, and SS files for each of the rows in the ex_01_subject.sas7bdat source file. It looks like there should be 1 row in the ANTHRO output for each subject id and age in days pair in the visit source file.
4. Column I designates a SASFMT specifier for certain fields. Please incorporate this into the scripts.
5. Please include the labels described in the “TARGET_VARLABEL” column.
6. In column M, mapping files are listed which allow for simple value substitutions in the output files. Your process should interrogate the provided csv formatted mapping files to allocate the required field length for processing the input files. The mapping files allow for simple value substitution. They may include a null value in the second row of the document.

data subj_gender;
    length Original $15.;
    infile &gender_csv dlm=',' firstobs=2 DSD;
    input Original $ MappedGender;
run;

7. Column N designate SAS code snippets that should alter the output format or default the output values in some way.

8. The two columns -- IS_TARGET_FIELD_CODE_NAME and IS_TARGET_FIELD_CODE_VALUE -- require a little background. SAS analysts focused on medical research often have to generate “code” files from relational records. For example, let’s say we have subject record with id 101, and family income of $1000, who lives in home with a dirt floor. This can be represented in a single row with columns of SubjectId, Family Income, and Floor Type. However, this could also be represented in key value format in the following way with 2 rows instead of one:

In our metadata file (ex01_DDF.csv), there are two flags in Columns J and K:
IS_TARGET_FIELD_CODE_NAME
IS_TARGET_FIELD_CODE_VALUE
A ‘Y’ value in either of these columns indicates that the field is going to be used in key/value (or code/value) form. The code values (‘INCOME’, ‘FLOOR’) in the data shown above are designated in external files called shell files.

The skeleton CodeShellFile files are created from the DDF file by executing the “-c” option of our command line application. A user then fills out the created template file -- in this case columns F and G with the appropriate codes for each code type field. The example above shows what the CodeShellFiles will look like after they have been completed. If our DDF file has a flag of ‘Y’ in the ‘IS_TARGET_FIELD_CODE_NAME’ the SAS script generation process should go looking for the Shell File which has the naming convention of TARGET_DATANAME + “_CodeShellFile.csv”. In this case the CodeShellFile is called SS_CodeShellFile.csv. The values for the fields which are mapped to IS_TARGET_FIELD_CODE_VALUE = ‘Y’ should come from corresponding original data field mapped in the DDF file.
9. Please update the main method of the SasTool class (lines 61-63) to reflect the latest usage and jar file name.

Final Submission Guidelines

Submission Guidelines
1. Please use the existing code as a starting point: https://github.com/topcoderinc/SAS-Data-Preparation
2. You should provide your GitHub Id in the Forums attached to this challenge to access the repository listed above. Please fork the repository. The Challenge Administrator will give you the necessary access. Winners will be asked to submit pull requests.
3. Please use the existing maven script for the app.
4. You can find the sample SAS output code and the nypd_collision data and metadata files in the Code Document forum attached to this challenge.

SAS Data Preparation Tool - Dynamic Script Creation

Key Information

Challenge Overview

Final Submission Guidelines

LEARN:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30055483