Topcoder Challenge | Topcoder Community

Challenge Overview

Background

The Bill & Melinda Gates Foundation’s (BMGF) Healthy Birth, Growth, and Development (HBGD) program addresses the dual problem of growth faltering/stunting and poor neurocognitive development, including contributing factors such as fetal growth restriction and preterm birth. The HBGD program is creating a unified strategy for integrated interventions to solve complex questions about (1) life cycle, (2) pathophysiology, (3) interventions, and (4) scaling intervention delivery. The Healthy Birth, Growth, and Development knowledge integration (HBGDki) Open Innovation platform was developed to mobilize the global “unusual suspects” of the data science community to better understand how to improve neurocognitive and physical health for children worldwide. The data science contests are aimed at developing predictive models and tools that quantify geographic, regional, cultural, socioeconomic, and nutritional trends that contribute to poor neurocognitive and physical growth outcomes in children. The tools and scripts developed by this challenge will support the data analysis efforts of the HBGDki Open Innovation initiative.

Description

The Gates Foundation is hoping to develop capabilities to allow their SAS programmers to become more productive. In this challenge stream, we’re developing an application which can dynamically read source data from SAS binary files and generate SAS scripts to read and transform those files. In a previous challenge, we have developed a basic SAS script code generation tool which can create those scripts based on metadata files in csv format. The java command-line application can also create some template documents for analyst users to populate and validate that the metadata files and binary files are in sync with each other and that the metadata files accurately describe the state of the input data. Through command-line arguments, users of the app can create the shell files, generate a validation report, or output the required script.

We have a new Example 2 data set that we’re using to test our application. This data can be found in the Code Document forum of this challenge.

The Example 2 data set has the following input files:

- ex02_biochem.sas7bdat
- ex02_hematology.sas7bdat
- ex02_mother.sas7bdat
- ex02_subject.sas7bdat

The metadata file for the data set is: ex02_DDF.csv. The purpose of the Generate Script command described below in the requirements is to generate a “.sas” script. When executed this script generated by this application should read the binary files listed above and should produce 4 binary files as described in the ex02_DDF.csv:

- lb.sas7bdat
- parents.sas7bdat
- preghx.sas7bdat
- subj.7bdat

Most of the requirements for the script generation functionality of the application can be found here in the previous challenge specification. But there have been a couple of changes in the DDF file which you’ll need to handle:

1. We’ve added Column F - TARGET_SORT. This defines more clearly the target order expected from the records in the output file. Column E defines the key value and order of the input documents. Column F describes the order that is expected in the target files.

2. We have a few columns in the current document where the data types are different between the input and the output files. The application will need to execute a conversion process to migrate data from “char” to “num“ type.

3. There is a merge process required between two of the files. The parents.sas7bdat dataset is going to be created by merging the mother and subject input files by the key variable(s) in those files. When two or more input datasets are mapped to the same target dataset, the target dataset is to be created by merging the input datasets by the key variables that are common to the input datasets, except as noted below in #4.

4. There is a concatenation process required between two of the files. (Concatenation is the term that SAS uses to describe combining 2 or more datasets with the “set” statement so that the number of rows in the new dataset equals the sum of the rows in the input datasets -- this is the same to the rbind command in R). The lb.sas7bdat output is going to be a combination of the biochem and hematology input files. Each of the files is a code value file as well. This makes logical sense if you think about it -- we’re combining code and value from different sources into a consolidated code value file. The code value format is explained in detail in the previous challenge. The key to this file will be the subject id, and the visits followed by a series of codes from both files.

5. There are a number of mapping files (now found in column N) which will be performing value substitutions. The script generation process should dynamically identify these files and generate the code process the value substitution required. Some of the substituted values are involved in the sorting of the output file.

6. Please create a command line parameter which will enter a new column/field and a corresponding value into every single record produced with the script validation tool in all output files. The parameter should have two parts - “global field name” and “global field value”. For example, if the “global field name” = “StudyID” and the global field value = “Example 02” then every row of each of our output files will have a new column called “StudyID” with a value of “Example 02”.

Final Submission Guidelines

1. Please use the existing code as a starting point: https://github.com/topcoderinc/SAS-Data-Preparation
2. You should provide your GitHub Id in the Forums attached to this challenge to access the repository listed above. Please fork the repository. The Challenge Administrator will give you the necessary access. Winners will be asked to submit pull requests.
3. Please use the existing maven script for the app.
4. You can find the sample data and metadata files in the code document forum attached to this challenge.

SAS Data Preparation Tool - Data Merging Enhancements

Key Information

Challenge Overview

Final Submission Guidelines

LEARN:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30055485